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Introduction 

In  this  paper  I  examine  a  fairly  broad  range  of  parallel  queue  and  pool  (unordered  queue) 
algorithms  and  present  data  indicating  their  relative  strengths  and  weaknesses.  The  algorithms 
have  been  implemented  on  the  NYU  Ultracomputer  prototype,  an  eight  processor  MIMD,  shared 
memory  single  bus  machine.  The  data  have  been  normalized  to  remove  the  effects  of  bus  conten- 
tion. I  discuss  the  implementation  of  a  few  of  the  basic  synchronization  primitives  and  the  imple- 
mentation of  each  algorithm  to  be  tested.  Two  programs  were  written,  one  rqtst  provided  a  gen- 
eralized arena  for  timing  the  different  algorithms  under  different  circumstances.  The  second  pro- 
gram, fifo,  was  used  to  test  the  parallel  fifo-ness  of  the  queues,  that  is,  to  test  whether  the  given 
algorithm  truly  represented  a  queue. 


The  NYU  Ultracomputer 

The  Ultracomputer  is  an  eight  processor,  multiple  instruction  -  multiple  data  (MIMD), 
shared  memory  machine.  Each  processing  element  (PE)  is  connected  to  SMB  of  shared  memory 
through  a  single  32-bit  bus.  Each  PE  consists  of  a  single  Motorola  68010  microprocessor  with  32 
KB  of  private  cache  memory.  The  PEs  use  software  long  integer  multiply  and  divides.  A  PDP 
1 1/34  is  used  to  perform  I/O  for  the  Ultracomputer. 

The  basic  primitive  fetch-and-and,  to  be  discussed  shortly,  is  supported  in  hardware  at  the 
shared  memory  end  of  the  bus.  All  other  fetch-and-phi  operations  (store  and  or)  are  implemented 
in  software  through  the  use  of  fetch-and-add. 

Dynamic  shared  memory  allocation  by  spawned  processes  is  not  supported  in  the  current 
system.  In  the  algorithms  that  required  this,  I  simulated  it  by  allocating  enough  memory  before 
spawning.  This  means  of  course  that  the  totd  memory  size  was  larger  than  it  had  to  be  in  these 
cases. 

For  more  information  on  the  Ultracomputer  I  refer  you  to  references  |7|  and  [8).  Reference 
[7]  in  particular  provides  a  very  nice  overview  of  software  and  hardware  considerations  concerning 
the  Ultracomputer. 


What  is  a  Parallel  Queue? 

Queues  and  pools  are  basic  elements  in  any  multi-programming  system  and  are  used  in 
many  applications.  They  comprise  an  important  and  heavily  used  data  structure  in  operating  sys- 
tems and  so  should  be  relatively  light  weight  (in  memory)  and  as  fast  as  possible.  There  is,  of 
course,  a  trade  off  between  memory  and  speed.  In  a  serial  environment,  speed  is  measured  strictly 
by  the  number  of  machine  instructions  executed  on  each  insert  and  delete  (this  discussion  a.ssumes 
the  data  structure  is  neither  full  nor  empty).  In  a  parallel  environment,  one  must  consider  the 
added  time  due  to  synchronization.  For  example,  a  linked  list  may  be  a  fast  serial  algorithm,  but 
when  synchronization  is  added  to  accommodate  parallelism,  linked  lists  become  a  relatively  poor 
queue.  This  is  because  each  of  the  processes  must  serialize  to  gain  access  to  the  list.  We  will  say 
that  an  algorithm  is  highly  parallel  if  the  time  it  takes  n  processes  to  in.sert  in  parallel,  is  the  same 
within  a  constant,  to  the  time  it  takes  one  process  to  insert  serially. 

The  term  queue  implies  an  ordering  to  the  items  in  the  queue  corresponding  to  the  order  in 
which  each  was  inserted.  What  does  it  mean  to  be  ordered  when  many  processes  are  inserting 
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items  onto  the  queue  concurrently?  To  discuss  parallel  queues  then,  we  must  first  define  what  it 
means  to  be  ordered.  Gottlieb,  Lubachevsky  and  Rudolf  (GLR)  in  [51,  defined  the  characteristics 
of  a  parallel  queue  as  follows, 

//  insertion  of  data  item  p  is  completed  before  iruertion  of  another  data  item  q  is 
started,  then  it  must  not  be  possible  for  a  deletion  yielding  q  to  complete  before  a 
deletion  yielding  p  has  started. 

This  implies  that  once  an  item  is  on  (allocated  a  position  in)  the  queue,  it  must  be  assigned  to  a 
delete  process  before  any  later  items  in  the  queue  are  assigned  a  delete  process.  This  does  allow 
for  later  items  in  the  queue  to  be  returned  earlier  than  previous  items  on  the  queue,  however.  Due 
to  scheduling,  the  process  assigned  to  delete  q  may  actually  complete  before  the  process  assigned 
to  delete  p. 


Some  Synchronization  Routines 

GLR  discuss  the  various  synchronization  primitives  used  in  parallel  algorithms,  the  most 
basic  of  which  is  the  atomic  operation  fetch-and-add,  also  referred  to  as/aa.  The  result  oi  faa(i,a) 
is  to  increment  /  by  a,  and  return  the  original  value  of  /.  This  primitive  is  used  to  implement  the 
test-modify-retest  paradigms  tdr  (test  decrement  retest)  and  tir  (test  increment  retest),  in  which  a 
variable  is  tested  to  see  if  it  will  exceed  some  bound  when  either  decremented  or  incremented, 
respectively.   A  bound  of  zero  is  assumed  for  tdr.   Their  implementations  are  as  follows, 


boolean  function  tir(s,delta,bound) 
tir  =  false 
if  (s  +  delta  <  -  bound)  then 

if  (faa(s,delta)  <  =  bound) 

tir  =  true 
else 

faa($,-delta) 
end  tir 


boolean  function  tdr(<;,delta) 
tdr  -  false 
if  (s-dclta  >  -  0)  then 

if  (faa(s,delta)  >  "  0)  then 

tdr  "  true 
else 

faa(s,-delta) 
end  tdr 


Race  conditions  and  starvation,  which  are  present  in  this  implementation,  arc  discussed  in  GLR. 
Tdr  and  tir  will  be  used  extensively  to  see  if  the  queues/pools  are  empty  or  full,  respectively,  and 
can  therefore  be  deleted  from  or  inserted  on  to. 

The  generalized  counting  semaphores  P  and  V  are  used,  most  importantly,  to  obtain 
exclusive  access  to  a  data  structure,  such  as  the  linked  lists  above.  F.ach  semaphore  is  imple- 
mented as  an  integer  that  is  initialized  to  a  value  that  depends  on  the  algorithm.  The  following  are 
the  implementations  of  P  and  V, 


procedure  P(semaphare,dclta) 

repeal  until  tdr(semaphore,delta) 
end  P 


procedure  V(semaphore, delta) 

faa(scmaphorc,-dclla) 
end  V 


In  the  linked  list  example,  the  integer  would  be  given  the  initial  value  one.  The  list  would  be 
obtained  using  P(semaphore,\)  and  released  with  V{semaphore,\). 
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The  remaining  three  synchronization  primitives  are  more  involved  and  will  only  be  discussed 
to  describe  their  semantics. 

TTie  first  is  the  group  lock  synchronization  algorithm  developed  by  Dimitrovsky  in  [6].  This 
following  set  of  routines 

glock(lock)/gunlock(lock)  -  enter  or  leave  a  group 

gsync(lock)  -  synchronize  the  group 

provides  for  the  arbitrary  grouping  of  processes  at  a  boundary  {glock  or  gsync).  Clock  acts  like  a 
waiting  room  in  which  processes  wait  (busy-wait)  for  a  door  to  open  at  which  point  they  all  go 
through  the  door.  TTie  door  will  not  open,  until  the  previous  group  of  processes  has  exited 
through  gunlock.  Gsynch  is  used  between  the  glock  and  gunlock  calls  to  resynchronize  the  grouped 
processes. 

The  second  is  something  called  a  reader-writer  lock.  In  this  case  there  is  a  lock  associated 
with  a  data  structure  that  can  be  read  by  any  number  of  processes  when  no  process  is  writing  (or 
waiting  to  write)  to  the  structure.  At  any  time  any  reader  may  become  a  writer,  at  which  time  all 
requests  for  read  locks  are  denied  access  and  forced  to  busy-wait  until  the  writer  has  released  its 
lock.  The  new  writer  may  not  proceed  until  all  current  readers  have  released  their  locks,  and  there 
may  only  exist  one  writer  at  any  given  time.  A  naive  use  of  a  reader-writer  lock  in  queues  would 
be  to  avoid  integer  overflow  resulting  from  faa  on  position  counters  in  the  queue.  When  the 
counter  reaches  some  critical  value,  a  reader  lock  is  upgraded  to  a  writer  lock,  which  waits  for  all 
current  readers  to  finish,  and  then  modifies  the  counter.  Note  that  this  needn't  be  done  with  a 
faa  since  ail  other  modifiers  of  the  counter  have  been  locked  out. 

The  set  of  required  reader-writer  routines  is  as  follows. 

rw_initlock(lock)  -  initialize  a  reader-writer  lock 

rw_rlock(lock)/rw_runlock(lock)  -  obtain/release  a  reader-writer  read  lock 

rw_wlock(lock)/rw_wunlock(lock)  -  obtain/release  a  reader-writer  write  lock 

read2write(Iock)  -  change  from  a  read  lock  to  a  write  lock 

write2read(lock)  -  change  from  a  write  lock  to  a  read  lock 

Their  implementations  can  be  found  in  GLR,  [10|,  |1|  and   [2|. 

The  last  is  the  reader-reader  lock  in  which  there  arc  two  groups  of  processes  A  and  B,  that 
can  not  access  a  data  structure  at  the  same  time.  That  is,  the  data  structure  will  not  be  accessed 
by  processes  from  both  groups  concurrently,  either  group  A  or  group  B  can  access,  but  not  both. 
Arbitrarily  one  group  is  given  priority,  which  can  lead  to  starvation.  A  reader-reader  lock  is  useful 
in  dynamic  hash  queues  in  which  the  size  of  the  hash  table  is  allowed  to  shrink  and  expand.  One 
would  not  want  both  an  expand  and  a  shrink  taking  place  simultaneously,  so  a  reader-reader  lock 
is  used  in  which  group  A  are  the  expanders  and  group  B  the  shrinkers. 

The  set  of  required  reader-reader  routines  is  as  follows. 

rr_initlock(lock)  -  initialize  a  reader-reader  lock 

rr_Alock(lock)/rr_Aunlock(lock)  -  obtain/release  a  reader-reader  A  lock 

rr_Block(lock)/rr_Bunlock(lock)  -  obtain/release  a  reader-reader  B  lock 

Their  implementations  can  be  found  in  (91,  |1|  and  12]. 
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Queue/Pool  Overview 

TTie  next  few  sections  describe  the  algorithms  that  were  tested.  I  treat  some  of  the  older  ones 
less  thoroughly  and  go  into  more  depth  with  the  newer  ones.  There  are  three  classes  of  algo- 
rithms. The  singleton  queues  are  FIFO  queues  that  do  not  support  multiple  item  insertion,  while 
the  second  type,  multi-queues,  do  support  this  feature.  The  third  type,  which  is  not  a  queue  but 
rather  a  grouping  of  items,  I  have  been  referring  to  as  a  pool.  All  algorithms  use  busy-waiting 
synchronization  primitives  as  opposed  to  the  non-busy-waiting  (or  blocking)  type. 

I  define  the  following  variables  in  order  to  quantify  the  various  algorithms'  performance. 

m  -  is  the  number  of  queues, 

M  -  is  the  maximum  number  of  queues  that  can  be  supported. 

n  -  is  the  number  of  items  that  may  appear  on  all  m  queues  at  one  time  (not  counting 

multiplicities), 
A'  -  is  the  maximum  number  of  items  that  can  be  supported  on  one  queue. 
p  -  is  the  amount  of  parallelism  to  be  supported. 


I  have  used  these  quantities  to  define  each  algorithms'  memory  and  time  requirements  in  Table  1 
below.   Where  a  quantity  is  not  defmed  above,  see  the  description  of  the  algorithm. 


Queue 

Memory 

Time 

Insert 

Delete 

llist 

queue 

hqueue' 

dhqueue^ 

fqueue 

vqueue 

0(m  +  n) 
0(px  m-l-n) 
0(m  +  n  +  hsize) 
0(m  +  n -1- hsize(n)) 
0(m  X  n) 
0(m  X  n) 

0(1) 

0(1) 

0(1) 

0(2  + 2  X  MaxLF) 

0(1) 

0(1) 

0(1) 

0(1) 

0(  1  +  n/hsize) 

0(2  +  3xMinLF/2) 

0(1) 

0(1) 

mqueue' 
tqueue^ 

0(p  xm  +  n) 
0(n+m  X  (log^n+  1)) 

0(1) 

O(l  +  log^n) 

0(1) 

0(1  +log^n) 

set 

cset 

pool 

0(m  X  n) 

0(m  X  n/wordsize) 

0(p  xm  +  n) 

0(1) 
0(1) 
0(1) 

ppp 

/  hsizefl  li  Che  fize  of  the  hash  table. 

2  Tfme  comptexities  ihown  are  wont  cast.  n/MwLF  s   hstze  i   n/MlnLF 

3  Ptrformanct  dtpwnds  upon  th§  multiplicity  (C)  of  tha  ittmi.     Whwn  p  <  C,  Um»  depmdtmcy  opproeehin  0(1 },  but  in  tht  worst  cost  whtrt  c   "    /,  th*  lim*  depen- 
dmey  is  0(p}  on  optrogt.     This  is  because  tbt  proctss&s  must  s*riehjt  to  obtain  sucetssivt  thmmts 

4  Tht  numbtr  of  ehildrtn  of  toeh  nod*  in  tht  trtt  is  vanoblt.  For  tht  standard  Itsts  It  -  rt. 

Table  1. 


The  time  complexities  shown  in  the  table  represent  best  case  times,  with  exception  to  those  for 
dhqueue.  I  have  assumed  that  queue/pool  is  not  full  when  inserting  or  empty  when  deleting,  and 
no  process  has  to  wait  for  an  item  to  be  inserted  or  deleted  (i.e.  assume  no  cell  contention). 
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Each  queue/pool  maintains  a  qsize  counter  to  indicate  the  number  of  items  currently  on  the 
queue.  On  insertion,  either  tir{qsize,nitems,maxqsize)  or  faa{qsize,nitem.^)  is  performed.  The  lir  is 
used  when  the  queue  is  of  fixed  size,  and  if  it  fails  insertion  fails.  Except  in  the  case  of  multi- 
queues,  the  value  of  nitems  will  be  one.  On  deletion,  each  queue  or  pool  performs  a  ldr{qsize,\) 
to  see  if  there  are  any  items  available.    If  the  tdr  fails,  deletion  fails. 

Each  algorithm  uses  head  and  tail  counters  as  indices  into  the  queue/pool  to  delete  from  or 
insert  on  to,  respectively.  Eventually,  these  counters  will  overflow  if  some  action  is  not  taken  to 
prevent  this.  In  some  cases  (the  hash  queues  for  example),  overflow  is  not  a  problem.  Overflow 
will  only  cause  a  discontinuity  in  the  indices;  some  queues/pools  are  more  sensitive  to  this  discon- 
tinuity than  others  and  therefore  may  require  an  adjustment  of  head  and  tail  as  the  data  structures 
age.  In  the  algorithms  that  can't  handle  overflow,  in  general,  a  check  is  made  on  each/aa  of  head 
or  tail  to  see  if  the  value  has  reached  some  predetermined  bound.  If  so,  the  counter  is  decre- 
mented by  bound  with  a  faa. 

All  of  the  queues/pools,  with  exception  to  the  fixed  sized  singelton  queues/pools,  provide 
the  ability  to  prematurely  remove  an  item  from  the  data  structure  regardless  of  its  position  there 
within.  In  all  cases  this  involves  leaving  what  is  called  a  "hole",  a  temporary  place  holder  that 
when  encountered  wiU  be  deleted  from  the  structure  but  will  not  be  returned  as  a  deleted  item. 
Rather,  deletion  continues  through  the  data  structure  until  the  first  real  item  is  found.  It  is  neces- 
sary to  leave  this  hole  in  most  cases  because  each  item,  once  it  is  put  on  the  queue  is  assumed  to 
be  there  and  associated  with  a  given  position  in  the  queue/pool.  A  place  holder  must  therefore  be 
left  in  place  of  removed  items.  I  have  not  attempted  to  test  the  various  methods  used  for  remo- 
val, but  I  mention  this  aspect  in  order  to  be  complete. 

The  pseudo-code  presented  below  is  meant  to  give  the  reader  the  basic  concept  in  the  algo- 
rithm. I  do  not  show  the  full  details,  because  as  one  might  expect  the  code  could  become  less 
readable  and  therefore  less  useful.  As  a  result,  and  possibly  among  other  things,  I  have  left  out 
overflow  handling  and  how  to  allow  for  interior  removal.  Except  in  the  case  of  mqueue,  for 
which  the  extra  code  is  included,  the  codes  should  not  require  added  synchronization  to  support 
interior  removal.   The  code  presented  here  is  a  good  representation  of  that  which  was  tested. 


Singleton  Queues 


Uist 

As  mentioned  earlier,  a  single  linked  list  is  not  too  useful  a  queuing  algorithm  in  a  parallel 
environment.  Distributed  link  lists  are  the  basis  of  the  algorithms  of  queue,  hqueue,  dhqueue  and 
mqueue,  so  I  will  mention  its  implementation  here.  It  will  also  be  useful  as  a  reference  when 
comparing  other  algorithms'  performance. 

Through  the  use  of  PjV,  a  binary  lock  (semaphore)  is  held  over  the  list  so  that  only  one 
process  can  access  the  items  from  the  list  at  once.  This  serializes  all  processes  accessing  the  list 
and  for  this  reason  will  not  support  highly  parallel  programming. 


Sllist 

The  serial  linked  list  was  added  to  cqtst  strictly  as  a  reference.  The  code  is  identical  to  that 
of  Hist  except  that  the  synchronization  primitives  have  been  removed.  This  will  provide  us  with  a 
feel  for  the  relative  speeds  of  the  other  algorithms,  and  possibly  an  idea  of  how  much  synchroni- 
zation costs. 
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Queue 

Queue  is  an  implementation  of  a  GLR-type  queue  of  unlimited  size  supporting  interior 
removal.  A  circular  array  of  linked  lists  is  used  to  hold  the  queue  items.  The  array  size  is  user 
definable,  but  should  be  chosen  to  be  larger  than  the  maximum  amount  of  parallelism.  A  faa  is 
performed  on  a  counter,  head  (for  delete)  or  tail  (for  insert),  that  is  used  to  index  into  the  array  of 
lists.  A  binary  lock  is  then  obtained  on  the  desired  list  and  the  item  is  either  deleted  or  inserted. 
Since  items  are  placed  at  the  end  of  each  list,  and  remembering  the  circular  nature  of  the  array, 
this  queue  Ccin  be  thought  of  as  a  spiral.  Items  are  inserted  at  the  outside  and  eventually  spiral 
down  to  the  center  where  they  are  deleted.   The  pseudo-code  for  queue  follows. 


Queue 

Initialize(q  ,paralJelism) 

Insert(q,item) 

Delete(q) 

q.cnt  :  =  0 

mytail  :  =  faa(q.lail,l) 

if  (not  tdr(q.cnt,l)) 

q.tail  :=  0 

mylist  :=  mytail  mod  q.size 

relurn(qucue  empty) 

q.head  :  =  0 

list:=  (address  o0q.lists|mylisl] 

myhcad  :  =   raa(qhead,l) 

q.size  :=  parallelism 

repeat 

mylisl  :  =  myhead  mod  q.size 

for  i  :  =  0  to  q.size- 1  do  begin 

until  (list- >  pulticket  =  mytail) 

list  :  =  (address  or)q.lists(mylist] 

q.ljsts(i].cnt  :=  0 

P(list->lock,l) 

repeat 

q,lists(i).putticket  :  =  i 

-  put  item  at  end  of  list 

until  (list-  ^  getticket  =  myhead) 

q.listslij.gettickel  :=  i 

V(list->lock,l) 

repeat 

PVinit(q.lists[il.lock,l) 

faa(list-  >  putticket.q.size) 

until  (ldr(lisl-^cnt,l)) 

endfor 

faa(list->cnt,l) 

P(list->lock,l) 

raa(q.cnt.l) 

-  get  Item  from  head  of  list 
V(list->lock,l) 
raa(lisl-  >  gellicket.q.size) 

The  time  complexity  of  this  queue  is  0(1)  for  both  deletion  and  insertion.  It  takes  constant 
time  to  fmd  the  correct  list,  and  with  the  correct  pointers  to  the  beginning  and  end  of  each  list,  list 
manipulation  is  also  constant  time.  The  memory  complexity  is  0(m*p  +  n).  It  is  likely  that 
within  an  operating  system  m,  the  number  of  queues  will  rise  at  least  linearly  with,  p  the  amount 
of  available  parallelism.  This  means  that  the  memory  requirements  grow  quadratically  with  p. 
This  is  an  important  resuh  and  should  be  paid  close  attention  to,  as  some  of  the  other  queue 
algorithms  have  more  amenable  memory  requirements. 


Hqueue/Dhqueue  Commonalities 

Hqueue  and  dhqueue  are  two  hash  queue  algorithms  that  are  very  similar  to  that  of  queue 
except  that  they  can  support  many  queues  on  one  hash  table  (circular  array).  Like  queue  these 
allow  for  an  unlimited  size  queue  and  interior  removal.  The  hash  queues  differ  from  queue  in  that 
each  item  on  each  queue  is  tagged  with  a  position  in  the  given  queue.  This  avoids  the  need  to 
keep  the  lists  ordered,  but  requires  deletes  to  search  the  list  for  the  correct  item.  Dhqueue  differs 
from  hqueue  in  that  it  utilizes  a  dynamic  hash  table  that  can  both  expand  and  contract.  Because 
these  are  new  algorithms  {hqueue  is  attributed  to  Isaac  Dimitrovski  in  reference  [1]),  I  will  discuss 
them  in  more  detail  than  the  previous  algorithms. 

Each  queue  in  a  hash  table  is  given  a  unique  identifier  mapped  from  the  range  1..A/-I-  1, 
where  M  is  the  maximum  number  of  queues  that  may  use  the  table  at  once.  The  extra  value  is 
used  to  identify  an  internal  queue  of  independent  queue  items  that  may  be  exchanged  for  a 
removed  item.  These  queue  identifiers  are  used  in  the  hash  function  to  distinguish  similar  items 
in  different  queues  and  to  keep  queues  from  tracking  one  another.  Tracking  occurs  if  the  rth  ele- 
ment of  queue  ql  and  the  j  element  of  queue  q2,  fall  in  the  same  bucket,  and  the  (/+  l)th  and 


(/■+  l)th  elements  of  queues  ql  and  q2,  respectively,  also  fail  in  the  same  bucket, 
hash  function  used  is  as  follows, 


The  generalized 


hashval  =  (qid  *  pos)  mod  hsize, 

where  pos  is  obtained  from  a  faa  on  either  head  or  tail,  accordingly.    The  hash  function  for 
dhqueue,  to  be  discussed  shortly,  is  slightly  different  but  u.ses  a  similar  concept  to  avoid  tracking. 

Along  with  the  notion  of  the  queue  identifiers  (ids)  comes  the  need  to  keep  track  of  the 
available  ids.  In  other  algorithms,  to  reuse  a  queue  one  can  just  re-initialize  it,  but  with  these  two 
algorithms  the  queue  ids  must  be  made  available  again  through  a  separate  function  called  dispose. 
Disposing  of  a  queue  mainly  involves  putting  the  id  back  in  some  kind  of  set  that  is  associated 
with  the  hash  table.  The  current  algorithm  uses  the  cset  algorithm,  which  will  be  discussed 
shortly.    Queue  initialization  then,  involves  labeling  the  queue  with  an  id  obtained  from  the  set. 

Interior  removal  is  handled  in  a  unique  way.  A  separate  internal  queue  is  maintained  whose 
items  are  used  to  pass  back  as  copies  of  the  removed  items.  Upon  queue  initialization  the  user 
indicates  the  number  of  removals  (holes)  that  the  queue  will  support.  Only  one  internal  queue  is 
used  to  support  the  removals  of  all  the  queues  however,  and  so  as  a  result  a  queue  may  get  more 
or  less  removals  than  were  requested  at  initialization.  This  scheme  is  a  little  more  fair  than  the 
one  used  in  queue,  which  requires  a  doubly-linked  list  and  a  counter  in  each  item  indicating  the 
number  of  times  an  item  was  removed  immediately  after  it  in  its  linked  list  (this  was  not  shown  in 
the  pseudo-code  for  queue).  One  pays  for  this  even  if  removes  aren't  required,  whereas  in  the 
{d)hqueue  implementation,  one  only  suffers  memory  and  time  penalties  when  actually  using  this 
feature. 


Hqueue 

Predominantly,  the  algorithm  discussed  above  is  hqueue.  There  are  a  few  differences 
between  it  and  dhqueue  that  were  not  mentioned,  though.  A  performance  critical  difference  is  the 
fact  that  an  hqueue  deletion  search  for  the  desired  item  in  the  bucket  outside  of  the  critical  Pj  V 
section.  This  is  in  contrast  to  that  of  dhqueue  and  should  provide  for  a  better  efficiency.  Hash 
table  initialization  requires  the  table's  size  (i.e.  fixed  size  table).  Since  the  hash  function  involves 
mod'ing  the  table  size,  table  size  should  be  given  as  prime.  The  pseudo-code  is  shown  below. 


Hpueue 

Initialize(q) 

Insert  (q, item) 

Delete(q) 

q.cnt  :  =  0 

q.tail  :=  0 

q.head  :  =  0 

q.id  :  =  csel_delete(cset); 

pos  :=  faa(q.tail,1) 

mylist  :=  hash(q,pos) 

list  :=  (address  of)q.lists|mylist| 

-  mark  item  with  q.id  and  pos 

P(lisl->loclc,I) 

if  (not  tdr(q.cnt,l)) 

rcturn(queue  empty) 
pos  :  =   faa(q.head,l) 
mylist  :  =   ha.':h(q,pos) 
list  :  =  (address  oOq.lists|mylist| 

-  put  item  on  list 
V(llst->lock,l) 
faa(q.cnt,l) 

repeat 

-  search  list  for  item  with 
(i.qid  =  q.id  and  i.pos  =  pos) 

until  (item  is  found) 
P(list->lock,l) 

-  get  item  from  list 
V(list->lock.l) 

Memory  requirements  are  0(/u/ze+  A/+  ri),  where  hsize  is  the  size  of  the  hash  table,  and  n  is 
the  number  of  items  on  at  most  M  queues  using  the  hash  table.  Notice  that  the  complexity  is 
linear  in  the  number  of  queues  and/or  processors,  assuming  M'=c*p.  In.sertion  is  0(1)  time,  but 
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deletion  becomes  0(  I  +  n/hsize)  since  each  operation  requires  searching  a  list  for  an  item  from  the 
designated  queue. 


Dhqueue 

Dhqueue,  as  mentioned  above,  utilizes  a  hash  table  that  can  grow  and  shrink  as  the  load  on 
the  table  changes.  The  algorithm  used  is  a  merging  of  the  Dimitrovsky  hash  queue  above  and  a 
serial  dynamic  hash  table  algorithm  discussed  in  reference  [4).  The  serial  dynamic  algorithm  is 
based  on  a  linear  expansion  method,  in  which  each  bucket  in  the  table  is  divided  or  merged  once 
per  cycle.  A  cycle  is  represented  by  the  amount  of  splits/merges  required  to  double/halve  the 
table  size.  Each  bucket  in  the  table  is  split/merged  once  per  cycle.  A  counter  p,  that  indicates  the 
next  bucket  to  be  split/merged,  is  initialized  to  zero  and  increment/decremented  with  each 
split/merge.  When  a  bucket  is  split/merged,  a  new/old  bucket  with  index  =  p+maxp  is  used  as  a 
buddy.  Maxp  is  the  current  maximum  value  that  p  can  obtain  before  it  is  reset  to  zero  and  the 
table  begins  its  next  phase  of  doubling.  When  p  is  decremented  to  zero,  divide  maxp  by  two  and 
set  p  to  the  resulting  value.  When  p  is  incremented  to  maxp,  p  is  reset  to  zero  and  the  value  of 
maxp  is  doubled.  A  compile  time  constant,  a  power  of  2,  is  chosen  as  a  minimum  table  size,  so 
that  the  value  of  maxp  is  always  a  power  of  2. 

Memory,  a  group  of  buckets  designated  as  a  segment,  is  allocated  in  pieces  that  double  in 
size,  up  to  some  maximum,  with  each  allocation.  This  was  not  the  only  strategy  available,  but  it 
seems  reasonable  to  always  allocate  space  in  proportion  to  the  current  size  of  the  table. 

The  hash  function  is  similar  to  that  of  hqueue,  using  the  queue  identifier  to  eliminate  track- 
ing, but  is  also  dependent  on  the  values  of  p  and  maxp  as  is  shown  in  the  pseudo-code  in  the 
table  below.  If  the  first  value  of  hashval  is  less  than  p,  the  corresponding  bucket  has  been  split.  If 
this  is  the  case,  the  queue  item  may  go  in  either  of  two  buckets,  corresponding  to  the  first  value 
of  hashval  or  its  buddy  at  hashval -'r  maxp. 

The  value  of  maxp  is  a  power  of  2,  which  makes  for  a  convenient  modulo  operation,  but 
generally  provides  for  a  horrendous  hash  function.  If  qid  is  even,  hashval  will  always  be  even  and 
every  other  bucket  will  go  unused.  On  the  other  hand  if  qid  is  odd,  and  since  the  values  of  pos 
are  sequential,  imp  will  be  alternately  odd  and  even.  Tliis  results  in  hashval  exhibiting  the  same 
behavior.  Each  odd  value  of  qid  less  than  the  smallest  value  of  maxp  (minimum  table  size)  will 
resuh  in  a  different  sequence  of  hash  values  over  the  range  (0..maxp-\).  This  provides  a  well  dis- 
tributed hash  function  without  any  one-to-one  tracking.  There  is  a  behavior  I'll  call  cyclic  track- 
ing that  results  for  the  odd  numbers  3  and  9,  for  example.  With  a  tablc.size  greater  than  8,  the  fre- 
quency of  such  tracking  is  only  3,  that  is,  every  third  clement  of  queue  3  collides  with  every  ele- 
ment of  queue  9.  Another  problem  with  the  odd  queue  identifier  .scheme  is  that  it  limits  the 
number  of  queues  on  a  hash  table  if  one-to-one  tracking  is  to  be  avoided  completely.  It  does 
seem  plausible  to  make  the  minimum  table  size  dependent  on  the  expected  maximum  number  of 
queues,  although  that  was  not  implemented  in  the  current  version. 

The  synchronization  used  on  an  insert  and  delete  is  similar  to  that  of  hqueue,  with  two 
exceptions.  Between  the  time  the  bucket  for  the  rrth  item  is  determined  and  the  time  the  lock  on 
that  bucket  is  obtained,  the  bucket  may  have  been  split  or  merged.  This  could  result  in  the  item 
not  being  found  on  a  delete  or  an  item  being  misplaced  on  an  insert.  The  solution  for  inserts  is 
to  check  that  the  bucket  number  hasn't  changed  once  the  lock  is  obtained;  if  so  release  the  lock 
and  rehash.  The  solution  for  deletes,  in  mentioned  earlier,  is  to  search  the  list  once  per  lock  on  a 
bucket.  If  the  item  is  not  found,  release  the  lock,  rehash  and  obtain  a  new  lock,  repeating  until  the 
item  is  found. 

To  simplify  the  discussion  of  expansion/contraction  I  will  for  the  most  part  refer  to  table 
expansion,  the  extension  to  contraction  is  a  relatively  clear  one.  Upon  initiation  of  an  expansion, 
a  read  lock  is  obtained.  A  faa{p,\)  is  performed  and  if  a  process,  call  it  A,  returns  maxp-\  the 
lock  is  upgraded  to  a  write  lock.    After  obtaining  the  write  lock,  process  A  then  resets  p  to  zero 
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and  doubles  the  value  of  maxp.  Processes  entering  after  A  that  obtain  p>  =  maxp,  release  their 
read  locks,  wait  for  maxp  to  be  reset,  get  a  new  read  lock  and  a  new  value  of  p.  This  synchroni- 
zation assures  that  one  full  cycle  of  expansion  (all  buckets  in  0..maxp-\)  are  split  before  the  next 
cycle  can  begin. 

Each  process  now  has  a  unique  value  oi p  and  p'  =  p+  maxp,  where  p  is  the  bucket  number 
to  be  split  and  p'  is  the  bucket  to  split  items  into.  After  locking  bucket  p,  each  item  is  rehashed 
and  if  it  should  be  moved  it  is  put  on  a  linked  list  external  to  the  queue.  The  lock  on  bucket  p  is 
released  and  the  resuUing  list  is  then  put  into  bucket  p'  after  obtaining  a  lock  there. 

To  exclude  expansions  from  being  concurrent  with  contractions,  a  reader-reader  lock  is  used 
around  the  calls  to  each  routine.  This  ensures  that  while  a  bucket  is  being  split  it  is  not  also 
being  merged.  More  importantly,  this  ensures  that  p  will  be  either  increasing  or  decreasing  during 
a  table  size  change;  ensuring  that  the  hash  function  be  reliable.  The  following  is  pseudo-code  for 
dhgueue. 


Dhqueue 

Initialize(htab) 

Initialize(q) 

Hash(q,pos) 

hlab.cnt  :  =  0 

q.cnt  :=  0 

tmp  :  =  q.id  *  pos 

htab.size:=  MinTabSize 

q.tail  :=  0 

p  :=  q.htab.p 

htab.qids;=  cselinit(cset,max  queues) 

q.head  :=  0 

maxp  :  =  q.htab.maxp 

rw  initlock(htab.rwlock) 

q.id  :  =  2*cset  dele(e(cset)-l ; 

hash  :  =  tmp  mod  maxp 

rr  initlock(htab.rrlock) 

if  (hash  <   p) 

for  i  :=  0  to  MinTabSize- 1  do 

hash  :  =  Imp  mod  2*maxp 

initialize  list(htab.lists[il) 

Dhaueue  (continuedl 

Insert(q,item) 

Delete(q) 

Expand(htab) 

pes  :=  faa(q.tail,I) 

if  (not  tdr(qcnt,l)) 

rw_rlock(htab.rwlock) 

-  mark  item  with  q.id  and  pos 

return(queue  empty) 

faa(htab.size,l) 

repeat 

pos  :  =  faa(q.head,l) 

maxp  :  =  htab.maxp 

mylist  :  =  hash(q,pos) 

repeat 

p  :=  faa(htab.p,l) 

list  :  =  (address  of)q.lists[mylist] 

mylist  :  =  hash(q,pos) 

if  (p  =  maxp-1) 

repeat 

list  :  =  (address  or)q.lists(mylist| 

rcad2writc(hlab.rwlock) 

until  (list  is  initialized) 

repeat 

htab.p  :  =  htab.p  -  maxp 

P(list->lock,l) 

until  (list  is  initialized) 

htab.maxp  :  =  2'maxp 

if  (mylist  =  hash(q,pos)) 

P(list->lock,l) 

wrile2read(hlab.rwlock) 

-  put  item  on  list 

-  search  list  once  for  item  with 

else 

V(list->lock,l) 

(i.qid  =  q.id  and  i  pos  =  pos) 

rr_runlock(htab.rwlock) 

until  (item  is  put  on  a  list) 

if  (found) 

-wail  for  hlab  maxp  to  be  reset 

faa<q.cnt,l) 

if  (mylist  =  hash(q,pos)) 

p  :  =  p  mod  maxp 

faa(htab.cnt,l) 

-  take  item  from  list 

maxp  :  ~  htab.maxp 

if  (Depth(q.htab  >  MaxLF) 

endif 

rr  rlock(hlab.rwlock) 

rr  Alock(htab.rrlock) 

V(list->lock,l) 

endif 

expandtable(q.hlab) 

until  (item  is  found) 

P(htab.lists|p|,l) 

rr  Aunlock(htab.rrlock) 

faa(htab.cnt,-l) 

-  rehash  items  in  list  p 

endif 

if  (Depth(htab)  <   MinLP) 

-  put  items  rehashing  to 

rr  Block(htab.rrlock) 

p  +  maxp  list  into  external  list 

shrinktable(htab) 

V(htab.lists|p),l) 

rr  Bunlock(hlab.rrlock) 

initialize_lisl(hlab.lisls(p  +  maxp|) 

endif 

P(htab.listslp+  maxpl.l) 

-  link  in  external  list  of  items 
V(htab.lists[p  +  maxpl,l) 
rw  runlock(hlab.rwlock) 
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The  memory  requirements  of  this  queue  are  0(m+  n).  I  have  used  two  quantities  above,  the 
minimum  and  maximum  load  factors  MaxLF  and  Minl.F  that  represent  the  deepest  and  shal- 
lowest average  bucket  depth,  respectively.  These  two  constants  determine  how  often  contraction 
and  expansion  will  take  place.  The  best  case  time  complexities  for  insert  and  delete  are  0(1)  and 
0{\+  MinLF/l),  respectively.  The  worst  case  is  when  a  table  size  adjustment  is  necessary.  The 
resulting  complexity  is  0{2+  2* MaxLF)  and  0{2+  2* MinLF/l)  for  insert  and  delete,  respectively. 
If  a  doubly,  rather  than  a  singly  linked  list  was  used,  these  times  could  have  been  reduced  to 
0(2+  MaxLF)  and  0{2  +  MinLF/2),  respectively.  Each  adjustment  requires  a  scan  of  two  lists  of 
length  MaxLF  or  MinLF.  A  full  scan  is  required  on  a  merge  in  order  to  find  the  end  of  the  parent 
list.  The  extra  \\a\i  for  a  delete  is  the  usual  average  scan  length  to  fmd  the  desired  item.  MinLF  is 
usually  chosen  to  be  less  than  0.61*MaxLF,  and  so  the  worst  case  will  be  for  the  insert.  In  the 
tests,  MinLF=  8  and  MaxLF=  16. 


Fqueue 

Fqueue  is  fixed  sized  group  lock  based  queue  in  which  user  items  arc  copied  into  queue 
storage.  Because  items  are  copied  into  the  queue,  interior  removal  is  not  supported,  and  pre- 
allocation  of  memory  for  the  maximum  number  of  items  [N)  that  may  be  on  the  queue  is  neces- 
sary. 

Insertion  and  deletion  are  both  two-phase  operations.  Insertion  takes  place  in  the  first 
phase  while  the  deletion  occurs  in  the  second  phase.  Alternately,  the  first  phase  of  deletion  is 
empty  and  item  removal  occurs  in  the  second  phase.  This  algorithm  allows  both  deletes  and 
inserts  to  enter  in  the  same  group,  but  within  a  group  serializes  the  inserts  before  the  deletes. 

Head  and  tail  counters  are  maintained  in  a  manner  similar  to  queue.  A  circular  array  is  used 
as  storage  to  hold  the  user  items,  with  a  size  counter  to  indicate  the  availability  of  space.  The 
pseudo-code  looks  as  follows. 


Fdueue 

Irutialize(q,size) 

Insert(q,item) 

Dclctc(q) 

q.cnt  :  =  0 

grouplock(q.glock) 

grouplock(q. glock) 

q.tai!  :  =  0 

if  (not  tir(q.cnt,l, q.size)) 

groupsynch(q  glock) 

q.head  :  =  0 

return(queue  full) 

if  (not  tdr(q.cnt,l)) 

q.size  :  =  size 

mycell  :=  faa(q.tail,l)  mod  q.size 

relurn(queue  empty) 

for  i  :=  0  to  q.size- 1  do 

-  put  item  into  cell 

groupunlock(q. glock) 

groupinit(q.cell(i].glock) 

faa<q.cnt,l) 

endif 

groupsynch(q.glock) 

mycell  :  =  faa(q.head,l)  mod  q.size      | 

groupunlock{q  glock) 

-  take  item  from  cell 
groupunlock(q. glock) 

Vqueue 

Vqueue  is  fixed  sized  baker's  algorithm  based  queue,  in  which  similar  to  fqueue,  values  are 
copied  into  queue  storage.  As  with  fqueue  interior  removal  is  not  supported,  and  pre-allocation 
of  memory  for  the  maximum  number  of  items  that  may  be  on  the  queue  is  necessary. 

Inserts  and  deletes  each  acquire  a  ticket  number  from  lad  and  head,  respectively.  The  ticket 
number  is  then  used  to  order  the  tasks  so  that  an  insert  follows  a  delete  which  follows  an  insert 
which  follows  a  delete  and  so  on.  A  single  ticket  is  associated  with  each  cell  in  the  queue  that 
indicates  who's  turn  it  is,  either  an  insert  or  a  delete. 
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Head  and  tail  counters  are  maintained  in  the  usual  way.  A  circular  array  is  used  as  storage 
to  hold  the  user  items.  The  algorithm  requires  two  counters  associated  with  the  number  of  items 
on  the  queue.  Filled  and  empty  are  the  number  of  cells  in  the  queue  that  contain  values  and  those 
that  don't,  respectively.  They  determine  whether  or  not  an  insert  or  delete  will  initially  fail.  The 
number  of  items  on  the  queue  at  any  one  time  is  just  the  value  o{  filled. 

This  two  counter  scheme  is  necessary  in  order  to  avoid  performing  a  tdr  and  tir  on  the  same 
counter  concurrently.  This  could  (and  did)  lead  to  deadlock.  If  «+  1  inserts  begin  and  compete 
for  n  cells  in  the  queue,  as  expected  one  insert  will  fail.  Now  if  the  failing  insert  is  context 
switched  out  before  it  can  increment  the  count  then  there  will  appear  to  be  r?+  1  items  on  the 
queue,  allowing  n+  1  deletes  to  begin,  one  of  which  will  endlessly  wait  for  an  item.  The  pseudo- 
code for  vqueue  is  shown  below.  Note  that  the  ticket  scheme  assumes  the  queue  is  of  at  least  size 
two. 


Vqueue 

Initialize(q,size) 

Insert(q,item) 

Dclete(q) 

q. empty  :  =  size 

if  (not  tdr(q.empty,l)) 

if  (not  tdr(q.filled.l)) 

q. filled  :=  0 

return(queue  full) 

return(queue  empty) 

q.tail  :=  0 

mytail  :  =  faa(q.tail,l) 

myhead  :  =  raa(q.head,I) 

q.head  :  =  0 

mycell  :=  mytail  mod  q  size 

mycell  :=  myhead  mod  q  size 

q.size  :  =  size 

cell  :  =  (address  or)q.cclk(mycell| 

cell  :  -  (address  oOq.cellsjmycell) 

for  i  :  =  0  to  q.size-1  do  begin 

repeat 

repeat 

q.celllij. ticket  :  =  i 

until  (cell- >  ticket  =  mytail) 

until  (cell- >  ticket  =  myhead+1) 

PVinit(q.cell(il.locl(,I) 

P(cell->lock,l) 

P(ccll->lock,l) 

endfor 

-  put  item  into  cell 

-  get  Item  from  cell 

V(cell->lock.l) 

V(c€ll->lock,l) 

faa(ccll->  ticket.!) 

faa(cell-  >  tickct,q.size-l) 

faa(q. filled,!) 

faa(q.empty.l); 

Multi-Queues 


Mqueue 

Mqueue  is  an  algorithm  very  similar  to  queue  that  incorporates  the  storage  of  a  multiple  item 
into  a  single  queue  item.  Items  are  stored  on  the  familiar  distributed  array  of  linked  lists.  Inserts 
are  very  similar  to  that  of  queue,  except  that  when  an  insert  notices  it  is  inserting  the  first  item  on 
the  queue,  it  sets  the  global  queue  vznabXt  firstmult  that  indicates  the  multiplicity  of  the  first  item. 
Deletes  initially  perform  the  usual  tdr.  After  acquiring  an  item  the  delete  decrements  the  multipli- 
city of  the  first  item,  and  if  it  goes  to  zero  firstmult  and  head  are  updated.  Note  that,  whereas 
queue  requires  ordering  the  deletes  mqueue  does  not  because  of  the  serialization  enforced  at  the 
head  of  the  list.  Some  extra  synchronization  is  needed  in  order  to  support  interior  removal. 
Since  items  can  be  removed,  the  next  item  on  the  queue  may  not  be  in  the  next  list.  For  this  rea- 
son a  search  is  made  of  the  lists  until  one  is  found  in  which  the  first  item  is  present.  I've  include 
the  extra  code  needed  to  support  the  interior  removal,  namely  the  reader-reader  locks.  Pseudo- 
code for  mqueue  is  shown  below. 
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M<jueue 

Initialize(q  .parallelism) 

Insert(q,item,mult) 

Dclcte(q) 

q.cnt  :  =  0 

rw  rlock(q.rwlock)'' 

rw_rlock(q  rwlock)* 

q.lail  :=  0 

mytail  :  =  raa(q.tail.l) 

if  (not  tdr(q.cnt,l)) 

q.head  ;=  0 

mylist  :  =  mytail  mod  q.size 

rw  runlock(q.rwlock)* 

q.size  :=  parallelism 

list  :  =  (address  or)q.lists[mylist| 

return(queue  empty) 

q.firstmult  :  =  0 

repeat 

endif 

rw_initlocl<(q.rwlocl<)* 

until  (list- >  putticket  =  mytail) 

repeat 

for  i  :  =  0  to  q.size- 1  do  begin 

P(list->lock,l) 

repeal 

q.lists|i|.cnt  :  =  0 

-  put  single  item  with  given 

until  (q.firstmult  >  0) 

q.listslil-putticket  :  =  i 

multiplicity  at  end  of  list 

until  (faa(q.firstmult,-l)  >  0) 

q,lisls[il  getlicket  :  =  i 

if  (mytail  =  q  head) 

mylist  :=  q  head  mod  q.size 

PVinit(q.lisls|il.lock,l) 

q.firstmult  :  =  mult 

list:=  (address  oOqlisls|mylist| 

end  for 

V(lisl->lock,l) 

repeat 

faa(list-  >  putticket, q.size) 

until  (ldr(list->cnt,l)) 

faa(list->cnt,l) 

-  use  item  at  head  of  list 

raa(q.cnt,l) 

if  (faa(list->  first- >  mult,- 1)  =   1) 

rw_runlock(q.rwlock)^ 

P(list->lock,l) 

-  remove  first  item  from  list 
V(list->lock,l) 
repeat 
mylist  :=  (mylist -i- 1 )  mod  q. size- 
list  :=  (address  of)q.lists(mylistl 
P(list->lock,l)^ 
q.head  :  =  mylist 
q.firstmult  :=  list- >  first- >  mull 
V(lisl->lock.l)* 
until  (q.firstmult  >  0) 
endif 

rw  runlock(q.rwlock)* 

Necessary  in  order  to  support  Interior  removel. 


Tqueue 

Tqueue  is  a  k-ary  tree  based  multi-queues  supporting  interior  removal  attributed  to  Isaac 
Dimitrovski  in  reference  [2|.  This  queue  provides  a  fixed  0(log/V)  access  time  and  has  a  more 
amenable  memory  requirement  than  does  the  other  multi-queue,  mqueue.  When  m  queues  {m 
different  trees)  that  can  each  hold  a  maximum  of  N  items,  are  associated  with  a  common  pool  of 
memory,  The  memory  complexity  has  been  shown  to  be  0(m*(l  +logA'-(-  N)). 

Each  queue  is  a  distinct  k-ary  tree  in  which  items  with  unit  multiplicity  are  stored  at  the 
leaves  of  the  tree.  Items  with  multiplicity  greater  than  one  are  stored  in  a  combination  of  internal 
nodes  and  leaves.    An  internal  node  can  hold  as  many  items  as  there  arc  leaves  in  its  sub-tree. 

As  in  previously  discussed  algorithms,  a  head  and  /a/Y  counter  arc  used  to  get  the  position  of 
the  item  on  the  tree  during  either  a  delete  or  an  insert.  If  I  call  the  depth  of  the  tree  d  then  the 
position  can  be  represented  as  follows 


pos  =  a, 


a,*k«. 


To  insert  or  delete  an  item,  follow  edge  a,  at  level  /  in  the  tree. 

Some  synchronization  and  queue  adjustment  are  required  in  order  to  function  properly.  A 
reader  lock  is  used  around  both  insert  and  delete  operations  with  the  delete  upgrading  to  a  write 
lock  when  it  is  necessary  to  adjust  the  queue.  When  traversing  a  path  to  a  position  in  the  tree, 
the  left  most  child  of  any  subtree  allocates  the  root  of  the  subtree.  All  the  other  children  of  that 
sub-tree  must  wait  (busy  wait)  for  it  to  be  allocated. 


12 


When  an  internal  node  is  allocated  it  is  given  a  use  count  equal  to  twice  the  number  of 
leaves  in  its  tree.  Each  time  a  process  goes  through  the  node  the  use  count  is  decremented.  It  is 
decremented  both  on  inserts  and  deletes  and  thus  the  factor  of  2  above.  When  a  nodes  usecount 
is  decremented  to  zero  the  node  can  be  taken  off  the  tree  and  returned  to  the  pool  of  free  nodes. 

The  queue  header  contains  two  pointers  that  can  each  point  to  a  tree  that  can  support  the 
maximum  number  of  items  allowed  on  the  queue,  A'.  As  a  result,  the  queue  can  actually  hold 
2*A'  items  if  there  are  enough  free  nodes.  However,  the  first  tree  will  have  to  be  emptied  before 
any  items  are  re-inserted  into  that  tree.  At  the  end  of  deletion  and  when  the  delete  counter  head 
equals  A',  the  read  lock  is  upgraded  to  a  write  lock,  tail  and  head  are  decremented,  and  the  second 
tree  is  moved  to  be  the  first  tree  of  the  queue.   TTie  pseudo-code  for  tqueue  is  shown  below. 


Tdueue 

Initialize(q,size) 

Insert(q,item,mult) 

Delcte(q) 

q.cnt  :  =  0 

rw  rlock(q.rwlock) 

rw  rlock(q.rwlock) 

q.tail  :  =  0 

if  (not  tir(q.qtcheck,mult,2*q.size)) 

if  (not  tir(q.headchcck,l  .q.qlail)) 

q.head  :  =  0 

return(queue  full) 

return(queue  empty) 

q.tailcheck  :  =  0 

pos  :  =  faa(q. tail, mult) 

pos  :  =  faa(q.head,1) 

q.headcheck  :  =  0 

if((pos  <  q.size)  and  (pos  +  mult  ^  q.size)) 

node  :=  q.root(pos  div  q.size] 

q.size  :  =  size 

additems(q.root(0), item, pos, 

while  (node  is  a  not  leaf-  or  multi-  node) 

q.logsize  :  =  log^fsize) 

q.logsize, q.size-pos) 

lastnode  :  =  node 

q.root(0|  :  =  null 

additems(q.root|  I  |,item,0. 

-  based  on  pos  go  down  a  level 

q.rootil)  :=  null 

q.logsize,mult-q.size  +  pos) 

in  the  tree,  changing  node 

rw_initlock(q.rwlock) 

else  if  (mull  =  q.size) 

check  free(laslnode.  1 ) 

q.rool|pos  div  q.size)  :  = 

endwhile 

initnode(2*  q.size.multinode) 

repeal 

else 

until  (node. item  <  >  null) 

child  :  =  pos  div  q.size 

item  :  =  node. item 

if  (pos  mod  q  size  =  0) 

check  rree(node,l) 

q.root[child]  :  = 

rw  runlock(q.rwlock) 

initnode)2*q.size,leafnode) 

if  (last  delete  for  0th  rool) 

else 

rw  wlock(q,rwlock) 

repeat 

q.rool|0|  :=  q.rool(l) 

until  (q.rool|child)  <  >  null) 

q.rool(l)  :  =  null 

endif 

q.lail  :  =  q.tail-q.size 

additems(q.root|child).item.pos, 

q.head  :  =  q.head-q.size 

q.logsize, mult) 

q.tailcheck  :  =  q  tailcheck-q.size 

endif 

q.headcheck  :  =  q.headcheck-q.size 

rw_runlock(q.rwlock) 

rw  wunlock(q.rwlock) 
endif 

13- 


Tqueue   (continued) 

Initnode(usecnt,type) 

Additems(node,item,pos,Ievel,mult) 

Checkfrce(node,n) 

node  :  =  new(nodetype) 
node.usecnt  :  =  usecnt 
node.lype  :  =  type 
return((address  of)node) 

if  (mull  =   1) 

while  {not  at  leaves) 
lastnode  :  =  node 
-go  down  a  level  based  on  pos 
if  (left  most  child  in  subtree) 
node.child(destl  :  = 
initnode(k"level,lcafnode) 
else 

repeat 

until  (node  child|dest]  <  >   null) 
endif 

checkfree(lastnode.l ) 
endwhile 

node. item  :  =  item 
else 

-  recursively  distribute  mult  items 

among  children 
checkfree(node,mult) 
endif 

if  (raa(node.usecnt,-n)  = 
frec(node) 

n) 

K  Is  the  number  of  children  of  each  node  in  the  tr«e 


Pools 


Pool 

Pool  is  very  similar  to  queue  except  that  no  efTort  is  made  to  order  the  linked  lists.  Like 
queue,  pool  supports  interior  removal  and  is  of  unlimited  size.  After  performing  the  faa  to  get  a 
list  number,  an  item  is  either  inserted  or  deleted  from  the  head  of  the  list.  Each  list,  if  not  for  rac- 
ing at  the  entrance  to  the  list,  would  be  a  LIFO  stack.  Since  algorithm  is  FIFO  over  the  lists 
though,  pool  can  only  be  thought  of  as  an  unordered  group  of  items.    Pseudo-code  follows. 


Pool 

Initialize(q  .parallelism) 

Insert(q,item) 

Delete(q) 

q.cnt  :  =  0 

mylist  :  =  faa(q.tail,l)  mod  q.size 

if  (not  tdr(qcnt,l)) 

q.tail  :=  0 

list  :  =  (address  or)q.lisls(mylistl 

return(queue  empty) 

q.head  :  =  0 

P(list->lock,l) 

mylist  :=  faa(q.head,l )  mod  q.size 

q.size  :  =  parallelism 

-  put  item  at  head  of  list 

list  :  =  (address  oOq.lists(mylist] 

for  i  ;  =  0  to  q.size- 1  do  begin 

V(Iisl->lock,l) 

repeat 

q.lists(l|.cnt  :  =  0 

faa(list->cnl,I) 

until  (tdr(list->cnt,l)) 

PVinit(q.lists(i|.lock,l) 

faa(q.cnt,l) 

('(list- >  lock, I) 

endfor 

-  get  item  from  head  of  list 
V(list->lock.1) 

Set 

Set  is  a  fetch-and-store  (Jas)  based  algorithm  developed  by  Jan  Edler  in  |3|  to  provide  opera- 
tions on  a  fixed  size  group  of  integers.  Similar  to  faa,  fas(a,i)  is  an  atomic  operation  that  stores 
the  value  of  /  in  a  and  returns  the  old  value  of  a.    An  array  of  integers  is  initialized  to  the  user 
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values,  with  head,  tail,  and  qsize  counters  maintained  to  access  the  array.  Because  the  user  can 
store  any  integers  (i.e.  pointers),  this  algorithm  has  all  the  power  of  pool. 

Referring  to  the  pseudo-code  below,  inserts  perform  a  fas  with  the  value  to  be  inserted. 
When  zero  is  returned,  the  value  has  been  successfully  stored.  Note  that  when  the  loop  is  per- 
formed more  than  once,  the  fas  is  returning  a  value  from  another  process.  Similarly,  deletes  per- 
form/ai  until  a  value  is  returned. 


Set 

Initialize(s,size) 

Insert(s,value) 

Delete(s) 

s.cnt  :=  0 

mycell  :=  raa(s.tail,l)  mod  s.size 

if  (not  tdr(s.cnt,I)) 

s.tail  :=  0 

repeat 

return(set  empty) 

s.head  :=  0 

value  :  =  fas(s.cells|mycell|, value) 

mycell  :  =  faa(s.head.l)  mod  s  size 

s.size  :  =  size 

until  (value  =  0) 

repeat 

for  i  :  =  0  to  s.size- 1  do 

faa(s.cnt,l) 

value  :  =  fas(s.cells[mycell),0) 

s.cells(i]  :  =  i  +   1 

until  (value  <  >  0) 
relurn(vaiue) 

Cset 

Cset,  also  developed  in  [3],  is  a  bit  mapped,  version  of  sel  using  both  fas  and  fetch-and-or, 
faor.  As  with  the  other  fetch-and-phi  functions, /aor(a,()  performs  a  bitwise  logical  OR  of  a  and  / 
and  returns  the  original  value  of  a.  The  algorithm  uses  both  a  bit  map  and  an  array  of  integers 
indicating  words  in  the  bit  map  that  contain  non-zero  bits.  Head  and  lail  counters  are  maintained 
over  the  latter  array,  and  are  used  to  index  into  it  to  obtain  indices  into  the  bit  map.  By  incre- 
menting head  and  tail  on  each  bit  map  access,  different  processes  acquire  different  words  in  the  bit 
map  and  thereby  avoid  serialization.  Serialization  can't  be  avoided  when  there  are  more  processes 
than  words  in  the  set.  After  obtaining  the  correct  word  in  the  bit  map,  the  faor  is  performed  on 
the  desired  word  with  the  correct  bit  turned  on  or  off  for  an  insert  or  delete,  respectively.  A  delete 
chooses  an  arbitrary  bit  in  the  bit  field  to  turn  off  and  use  as  its  return  value.  When  a  bit  is  the 
first/last  to  be  turned  on/off  the  array  of  integers  is  updated  using /a.J  to  add/remove  a  value  from 
the  array  as  is  done  in  set.   The  pseudo-code  for  cset  follows. 


Cset 

Initialize(s,size) 

Insert(s,value) 

Delcte(s) 

s.cnt  :  =  0 

word  :  =  (value- l)/bttsperword 

if  (not  tdr(s.cnt.I)) 

s.lail  :  =  0 

bitmaik  :  =  1  shI  (value  mod  bitsperword) 

return(set  empty) 

s.head  :=  0 

if  (faor(s.bitmap(word],bitmask)  =  0) 

mycell  :  =  faa(s.head,1)  mod  s. words 

s.size  :  =  size 

mycell  :=  faa(s.tail,l)  mod  s  words 

repeat 

s. words  :  =  size/bilsperword 

repeat 

value  :  =  fas(s.q[mycell),-l) 

for  i  :  =  0  to  s.words-1  do  begin 

word  :  =  fas(s.q(mycell|,word) 

until  (value  >  =  0) 

s.q(il:=  i 

until  (word  <  0) 

bits  :  =  fas(s.bitmap(wordl,0) 

s.bitmapfij  :  =  all  bits  on 

endif 

pos  :  =  position  of  set  bit  in  bits 

endfor 

faa(s.cnt,l) 

-  turn  ofT  corresponding  bit  in  bits 
value  :=  word' bitsperword  +  pos  +  1 
if  (faor(s.bitmap(word|,bits)  =  0) 

mycell  :  =  faa(s.lail,l)  mod  swords 

repeat 

word  :  =  fas(s.q(mycell),word) 

until  (word  <  0) 
endif 
return(  value) 
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Cqtst 

To  test  and  develop  the  algorithms  I  wrote  a  general  utility  that  accommodates  the  follow- 
ing, 

1)  Multiple  queue  testing 

2)  Variable  queue  size  (per  test) 

3)  Multiple  copy  insertion 

4)  Interior  removal 

5)  Variable  number  of  spawned  processes 

6)  Queue  'exercising'  (multi-process  dclcte/re-insert) 

7)  Timing  with  cqtst  overhead  removal 

Cqtst  was  written  with  more  capabilities  than  were  utilized  in  the  tests  done  in  this  paper.  It  is 
intended  as  a  general  tool,  to  develop  and  test  new  and  existing  algorithms.  All  of  the  facilities 
are  useful  to  this  end,  so  I  will  describe  all  of  them.  Tor  a  full  list  of  optional  arguments  and  their 
functions  see  Appendix  A. 

In  addition  to  the  capabilities  listed  above  cqtst  has  the  ability  to  perform  parallel/serial 
loading  or  emptying  in  any  combination.  That  is,  each  task,  loading  and  emptying  of  the  queues 
can  be  done  either  in  serial  or  in  parallel.  When  parallel  loading  and  emptying  is  requested  load- 
ing and  emptying  will  be  taking  place  at  the  same  time.  This  means,  however,  that  one  can't 
parallel  load  and  then  parallel  empty. 

Cqtst  is  intended  as  a  versatile  tool  with  which  to  test  algorithms  in  many  different  ways. 
Data  presented  in  this  paper  is  drawn  from  a  subset  of  tests  which  cqlsl  can  perform. 

As  of  this  writing  there  are  twelve  different  queue/pool  algorithms  incorporated  into  cqtst. 
They  are... 

1)   llist  2)  sUist  3)  queue  4)  fqueue 

5)   vqueue  6)  hqueue  7)  dhqueue  8)  tqueue 

9)   mqueue         10)  set  ll)cset  12)  pool 

The  items  that  are  inserted  on  the  queue  will  be  cither  pointers  or  values  depending  upon 
the  queue  being  used.  I've  used  a  shared  memory  array  of  integers  as  the  items  to  be  inserted. 
Retrieval  is  done  into  a  shared  memory  sink  using  a  fctch-and-add  on  the  index.  After  the  queu- 
ing and  dequeuing  is  finished,  the  sink  is  sorted  and  checked  against  the  source  array  and  multipli- 
city information.  The  sink  is  not  sorted  when  a  serial  load  and  serial  empty  was  performed  and 
queue  exercising  and  preremoval  were  not  done.  By  not  performing  the  sort  for  serial-serial  test- 
ing, an  allowance  is  made  to  check  for  the  serial  FIFO  capabilities  of  the  queue.  No  test  is  made 
for  parallel  FIFO  performance.   See  the  fifo  test  [fifo)  program  for  parallel  fifo  testing. 

Multiple  Queue  Testing  (-m) 

During  a  run  of  cqtst  there  can  be  more  than  one  queue,  all  of  the  same  type,  which  are 
being  operated  on  concurrently.  No  item  will  be  inserted  into  more  than  one  queue.  This  capa- 
bility is  not  allowed  when  using  sets  or  csets,  however.  Cqtst  is  written  so  that  queues  are  loaded 
and  emptied  in  order  and  each  process  operates  on  each  queue.  However,  each  process  does  not 
wait  for  the  queue  it  is  currently  working  on  to  be  filled/emptied  before  it  goes  on  to  the  next 
queue. 

If  I  call  numinserts  the  number  of  items  to  be  inserted  into  the  queues,  taking  into  account 
multiple  copies,  the  maximum  number  of  items  on  a  single  queue  at  one  time  will  be 
numinserts/m,  where  m  is  the  number  of  queues.  Space  is  allocated  for  the  situation  in  which 
each  queue  may  contain  numinserts  items,  but  the  total  number  of  items  on  all  queues  will  not 
exceed   numinserts  (i.e.   queues  are  only    1/m  full).     This   is  done  to  exploit   any   increase   in 
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performance  that  the   more  memory  efficient  queues  may  have.     See  Table   I   for  a  listing  of 
memory  requirements. 

Variable  Queue  Size  (-n) 

Here,  the  number  of  distinct  elements  to  be  supported  by  the  queues  can  be  manipulated. 
This  capability  is  necessary,  since  some  algorithms'  insert/delete  times  will  vary  depending  upon 
the  maximum  number  of  items  allowed  on  the  queue. 

Multiple  Copy  Insertion  (-c) 

This  allows  for  a  distinct  element  to  occur  in  the  same  queue  more  than  once.  Tlie  user 
specifies  a  maximum  multiplicity,  c,  then  the  multiplicity  of  each  element  is  randomly  chosen 
(before  queuing  begins)  from  1  to  c,  inclusive.  Multiple  insertion  is  meaningful  only  for  the 
multi-queues,  (queue  and  mqueue. 

Interior  Removal  (-r) 

This  aspect  allows  for  the  premature  removal  of  an  element  from  a  queue.  Whether  an  item 
is  to  be  removed  or  not  is  a  random  decision  (made  before  queuing)  that  results  in  about  20%  of 
the  items  being  marked  for  removal.  Currently,  there  are  two  methods  in  which  removal  can  be 
done, 

1  -  During  the  emptying  phase,  instead  of  doing  a  FIFO  empty,  perform  a  remove.  Each 
process  is  designated  a  subset  of  the  items,  and  sequentially  runs  through  the  subset 
removing  the  item  if  it  was  marked  for  removal,  otherwise  a  delete  is  performed.  This 
does  not  guarantee  holes  since  another  process  may  have  already  deleted  the  item  to  be 
removed.  Some  items  will  be  removed  though,  and  so  this  option  allows  for  the  parallel 
testing  of  item  removal. 

2  -  Perform  all  removals  before  doing  any  emptying.  This  requires  that  loading  be  done 
serially,  which  is  done  whenever  this  method  is  chosen.  With  this  mechanism,  one  is  able 
to  examine  the  relative  slowdown  of  an  algorithms'  retrieval  time  when  removals  are 
done. 

A  fetch-and-add  on  a  counter  is  performed  after  a  successful  removal,  and  the  total  reported  in 
the  output  page. 

Removal  should  not  be  used  when  items  are  given  a  multiplicity  greater  than  one  and  a 
multi-queue,  such  as  mqueue,  is  being  tested  for  which  a  single  removal  involves  deleting  every 
occurrence  of  the  item  on  the  queue.  Tqueue  does  not  exhibit  this  behavior  and  may  be  tested  in 
the  manner  just  described. 

Variable  number  of  spawned  processes  (-s) 

The  user  may  specify  the  number  of  processes,  S,  to  be  involved  in  the  testing.  These  S 
child  processes  are  divided  up  evenly  among  the  loading  and  emptying  tasks.  When  loading  and 
emptying  are  done  in  parallel  there  will  be  S/2  processes/task,  otherwise  all  S  processes  will  be 
devoted  to  the  single  parallel  task,  loading  or  emptying. 

S  may  also  be  the  number  of  processes  involved  when  performing  queue  "exercising".  Since 
serial  loading  and  emptying  is  forced  for  exercising,  there  is  no  conflict  with  the  alternate 
definition. 

Queue  "exercising"  (-p) 

This  capability  is  provided  to  simulate  an  environment  in  which  the  queues/pools  are  being 
heavily  used  by  as  many  processes  as  the  user  desires.  Queues/pools  are  first  loaded  up  in  serial. 
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Then  a  user  chosen  number  of  processes  are  spawned  which  then  delete  from  and  re-insert  onto 
the  queue/pool  a  user-specified  number  of  times.  After  aU  the  processes  have  fmished  the 
queues/pools  are  emptied  serially.  When  exercising  in  parallel  each  item  may  not  be  deleted  and 
inserted  the  same  amount  of  times,  nor  will  each  item  remain  in  the  same  position  with  respect  to 
other  items.  This  is  non-deterministic  with  the  caveat  that  when  done  exercising  the  queues  will 
still  hold  the  same  items.  This  will  however,  destroy  the  original  ordering  of  the  items,  and  so  the 
sink  must  be  sorted  in  order  to  check  for  queuing  errors. 

Queue  exercising  also  provides  for  the  testing  of  queue  readjustment.  Most  queues  will  use 
head  and  tail  counters  to  keep  track  of  where  items  should  be  inserted  and  deleted  from,  respec- 
tively. As  the  queue  experiences  more  inserts  and  deletes  the  head  and  tail  would  grow  without 
bound  unless  some  kind  of  readjustment  on  these  values  and  possibly  the  queue  itself  were  per- 
formed. Since  exercising  can  be  done  in  parallel  or  serial,  adjustments  can  be  tested  in  either 
environment. 

Exercising  does  not  work  when  a  multiplicity  greater  than  one  is  used  on  the  two  multi- 
queues.  The  mqueue  algorithm  assumes  the  user  puts  one  item  with  a  given  multiplicity  on  the 
queue  once  and  then  the  items  multiplicity  is  decremented  on  each  delete.  It  further  assumes  that 
the  item  just  retrieved  will  not  be  placed  back  on  'he  queue.  Since  this  is  exactly  what  queue 
exercising  is,  it  will  not  work  for  mqueue  when  multiplicities  are  greater  than  one.  With  (queue, 
memory  is  allocated  for  n  distinct  items  (not  n*c,  c  is  the  average  multiplicity  of  items  in  the 
queue)  that  may  or  may  not  have  a  multiplicity.  Each  time  an  occurrence  of  an  item  with  a  mul- 
tiplicity greater  than  one  is  deleted  and  re-inserted,  a  new  distinct  item  would  be  created.  This 
would  lead  to  the  need  for  more  memory  than  was  originally  allocated  and  cause  a  failure. 

Timing  (-t,  -b,  -p) 

At  the  end  of  each  run  a  single  page  of  status  information  is  produced  that  contains,  among 
other  thmgs,  information  about  the  amount  of  time  (wall  clock  time)  required  to  perform  initiali- 
zation, loading,  exercising  and  emptying.  There  are  three  times  available,  user,  system  and  total 
for  each  of  the  four  tasks  (initialization,  loading,  exercising,  emptying),  all  are  in  milliseconds. 
When  parallel/parallel  testing,  an  average  time  between  loading  and  emptying  is  reported. 

While  the  option  to  not  subtract  off  the  overhead  of  cqtst  (see  the  -b  option)  is  available,  a 
normal  run  of  cqtst  will  provide  this  subtraction.  This  is  accomplished  by  running  through  the 
same  operations  as  in  the  specified  test,  but  not  actually  calling  any  queue  routines.  The  amount 
of  time  to  do  this  is  accumulated  and  is  subtracted  off  the  time  obtained  when  the  queues  are 
tested. 

Inherent  in  the  timings  statistics  is  a  variance  which  the  user  should  be  aware  of.  To  reduce 
this  variance  which  is  present  in  all  timing  values,  the  -t  option  is  provided.  This  will  allow  the 
user  to  perform  the  same  operations  many  times  (queue  initialization  is  done  only  once),  and 
thereby  reduce  the  variance.  With  the  stipulation  that,  based  upon  the  timing  techniques,  the 
standard  deviations  will  be  no  larger  than  23  ms/^cyc!e,  the  user  can  pick  the  number  of  cycles 
needed  to  obtJiin  the  desired  accuracy. 

Output 

The  following  print  out  represents  a  default  run  of  cqtst.  The  following  command  produced 
the  second  printout. 

%cqtst  -pi 000  -t4  -s4  -q3  -h213  -r2  -ec 

Each  printout  gives  enough  information  to  completely  specify  the  input  options  (e.g.  queue  type, 
number  of  items  on  queue,  number  of  cycles,  number  children  ...).  Notice  that  serial  loading  was 
forced  for  the  second  print  out.  This  resulted  because  removal  method  two  was  chosen.  Once 
again,  all  times  are  in  milliseconds. 
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When  performing  queue  exercising  with  the  -p  option,  a  value  for  the  amount  of  time 
required  to  do  a  single  delete/re-insert  operation  is  reported  in  milliseconds  per  operation  normal- 
ized to  a  single  processor.  If  there  are  p  processes  and  it  takes  time  t  (wall  clock  time)  for  each 
process  to  complete  A'  delete/re-insert  operations,  then  the  I  define  the  wall  clock  rate  as  R^  = 
t/N.  Normalization  to  a  single  process  is  done  by  multiplying  the  wall  clock  rate  by  the  number 
of  processes. 

In  addition  to  the  timing  statistics,  information  about  the  amount  of  swaps  that  took  place 
during  testing  is  made  available.  This  is  provided  primarily  to  allow  the  user  to  verify  that  results 
were  not  invalidated  by  interference  from  the  system  swapper. 


Queue  type  used  :  fqueues 
Memory  :  0(n)  =   1024 

1024  elements  were  distributed  among  1  queue(s) 

Multiplicity       :  1  Total  insertions  :  1024/cycie 


Number  of  insert/delete  operations  per  process  per  queue 

(excluding  load/empty)  during  exercising  :  0 
Number  of  load/exercise/empty  cycles  :  1 

Lxjading      :  parallel  with  4  children 
Emptying     :  parallel  with  4  children 

The  following  stats  have  been  adjusted  for  overhead 

Wall  Clock  Time  init  load  exercise            empty      total 

user    (ms/cycle)  0  230  0  230         460 

system  (ms/cycle)      0  19  0  19           39 

total   (ms/cycle)  0  250  0  250         500 

Voluntary  Swaps        0  0  0  0  0 

Involuntary  Swaps      0  0  0  0  0 
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No  queuing  errors  detected 

No  queuing  errors  detected 

No  queuing  errors  detected 

No  queuing  errors  detected 

Queue  type  used  :  Hash  queues,     hsize  =  213 
Memory  ;  worst  case  n  =   1024 
1024  elements  were  distributed  among  1  queue(s) 
MultipUcity       :  1  Total  insertions  :  1024/cycle 

Removal  method  2  used,  182  elements  successfuUy  removed/cycle 

Number  of  insert/delete  operations  per  process  per  queue 

(excluding  load/empty)  during  exercising  :  1000 
Number  of  load/exercise/empty  cycles  :  4 

Loading      :  serial 

Exercising  :  parallel  with  4  children 

Emptying    :  serial 

The  following  stats  have  been  adjusted  for  overhead 

User    time/insert&delete  operation  :  0.613542  msec 
System  time/insert&delete  operation  :  0.019792  msec 

WaU  Clock  Time    init  load  exercise            empty      total 

user    (ms/cycle)      0  187  613  400        1201 

system  (ms/cycle)   200  0  19  16         236 

total   (ms/cycle)   200  187  633  416        1437 

Voluntary  Swaps        0         0  ^  ^  n 

Involuntary  Swaps      0  0  0  0  0 


-20- 


Fifo 

Fifo  is  a  relatively  short  program  written  to  test  algorithms  to  see  of  they  conform  to  the 
definition  of  parallel  FIFO  given  on  page  one.  The  method  fifo  uses  is  to  spawn  an  equal 
number  of  inserters  and  deleters  after  having  filled  the  queue  up  half  way.  Fach  inserter  inserts 
the  value  of  a  private  integer  that  is  incremented  on  each  insert.  The  inserters  masks  the  high 
four  bits  of  each  item  with  the  insert's  spawn  index.  Deleters  then  delete  from  the  queue  and 
thereby  obtain  items  from  all  the  different  insert  processes.  The  deleters  keep  a  local  count  for 
the  maximum  valued  item  thus  far  retrieved  for  each  insert  task.  If  at  any  time  a  deleter  retrieves 
an  item  for  an  inserter  that  is  lower  than  the  current  count  for  that  inserter,  a  fifo  error  has 
occurred. 

One  of  the  main  differences  between  fifo  and  cqisl  is  in  the  duties  of  each  spawned  process. 
When  exercising,  each  process  of  cqlst  performs  a  delete  followed  immediately  by  an  insert.  This 
imposes  a  balancing  of  inserts  with  deletes.  Fifo  on  the  other  hand  spawns  processes  which  are 
dedicated  to  a  single  activity,  inserting  or  deleting.  This  means  that  every  process  is  completely 
autonomous  with  exception  to  the  test  algorithms'  synchronization.  Fifo  performs  a  very 
thorough  test  of  the  queues,  except  that  no  attempt  is  made  to  see  whether  items  were  lost  from 
the  queues. 

I  will  not  present  any  hard  data  for  fifo  except  to  say  that  all  the  queues  have  successfuUy 
passed  this  test.  Fifo  was  a  very  helpful,  if  not  indispensable,  tool  in  determining  whether  a  queue 
really  is  a  queue  as  defined  earlier. 
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Testing  and  Results 


The  foUowing  data  was  gathered  using  the  "exercising"  facility  of  cqtst.  The  times  presented 
and  the  data  extracted  from  them  are  all  based  on  the  time  it  takes  one  process  to  do  a  delete  fol- 
lowed by  an  re-insert  of  the  deleted  item. 

Because  the  Ultracomputer  prototype  is  a  single  shared  bus  machine,  some  care  must  be 
taken  when  determining  the  relative  performance  of  an  algorithm  in  a  parallel  or  even  a  shared 
environment.  All  data  was  produced  in  single  user  mode,  so  that  bus  contention  results  only  from 
deliberately  concurrent  processes.  To  eliminate  the  efTects  of  bus  contention  due  to  s  spawned 
processes  during  a  single  parallel  run  of  cqtst,  s  separate  but  similar  serial  and  concurrent  runs  of 
cqtst  are  made  later.  The  latter  test  will  give  me  the  serial  speed  of  the  algorithms,  but  with  bus 
contention  similar  to  what  was  present  during  the  single  parallel  run  of  cqlsi.  1  give  the  following 
symbols  for  the  three  different  values  available. 

R,  =  serial  delete/re-Lnsert  rate,  with  no  bus  contention  ( 1  process) 

R^^  =  serial  delete/re-insert  rate,  with  bus  contention  (p  processes) 

R    =  parallel  delete/re-insert  rate,  with  inherent  bus  contention  {p  processes) 

Now,  from  these  three  values  I  can  extract  two  important  quantities  as  follows. 

Efficiency  =  R,^  /  R  ,  Contention  =  R,^  /  R, 

Efficiency  is  a  measure  of  the  amount  of  serialization  that  results  from  synchronization 
among  sibling  parallel  processes.  A  perfect  algorithm  would  not  suffer  any  such  serialization  and 
score  an  efficiency  of  one.  The  lesser  the  efficiency  the  worse  the  algorithm  performs  in  a  parallel 
environment. 

Contention  is  a  measure  of  an  algorithms  shared  memory  use.  A  perfect  value,  no  bus  con- 
tention, would  result  in  a  value  of  one.  The  worse  the  performance  the  larger  the  value  of  conten- 
tion. Contention  is  a  ratio  of  local  to  shared  memory  access  and  so  provides  no  information  con- 
cerning the  intensity  of  shared  accesses. 

All  tests  were  performed  on  a  queue/pool  that  could  support  up  to  1024  elements.  In  the 
case  of  queue,  mqueue  and  pool  the  number  of  lists  is  always  equal  to  the  number  of  processes 
accessing  the  data  structure.  When  hqueue  was  tested  the  size  of  the  hash  table  was  chosen  to  be 
101,  which  means  that  on  average  each  delete  had  to  search  a  list  of  size  ten.  For  the  first  set  of 
data,  the  value  of  ^  in  tqueue  was  six. 

The  objective  of  the  tests  providing  the  first  set  of  data  is  to  show  the  performance  of  the 
queues  and  pools  when  the  load  on  the  system  changes.  It  will  be  interesting  to  know  what  hap- 
pens as  the  ratio  of  the  number  of  PEs  available  for  scheduling,  to  the  number  of  processes  using 
a  queue  or  pool  decreases.  For  example,  initially,  there  may  only  be  four  processes  competing  for 
eight  PEs,  in  which  case  there  would  hopefully  not  be  a  problem.  But  as  the  system  becomes 
loaded  with  other,  possibly  unrelated,  tasks,  the  number  of  effectively  available  PEs  is  reduced  and 
thereby  changes  the  above  ratio.  I  wilt  refer  to  this  as  PE  contention.  Although,  in  a  multi-user 
system,  in  general  the  processes  will  not  be  of  the  same  type,  I  believe  I  effectively  simulated  this 
effect  by  adding  more  of  the  same  type  of  processes  (i.e.  more  children  of  cqtst)  to  the  system. 

Factors  which  may  contribute  to  performance  characteristics  under  these  circumstances  are 
the  scheduling  method  and  the  fact  that  busy-waiting  and  not  non-busy-waiting  semaphores  were 
used.  The  latter  is  probably  the  most  influential  of  the  two. 

Table  2  gives  the  serial  speeds  (R,)  of  the  difTcrent  algorithms.  listed  are  the  times  it  takes 
to  load  and  empty  the  queues  and  the  time  it  takes  to  perform  a  single  dclete/rc-insert  operation. 
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Becuase  s/list  and  Hist  difTer  only  in  a  P,  K  reference,  we  can  see  that  the  binary  semaphores  may 
add  as  much  as  .04  msec  per  reference  pair.  In  the  case  of  Hist  this  amounts  to  almost  halving  the 
speed. 

Graphs  1-5  and  6-9  show  the  values  of  R,^,  R  ,  efficiency  and  contention  versus  the  number 
of  sibling  processes  for  the  queues  and  pools,  respectively.  When  viewing  Graphs  2,  3,  6,  and  7, 
remember  that  the  bus  contention  has  not  been  removed.  Graph  3,  an  expanded  view  of  Graph  1, 
was  added  to  better  display  vqueue,  fqueue  and  tqueue  data.  Because  their  performance  was  so 
poor  when  the  number  of  processes  exceeded  the  number  of  PEs  available,  data  for  queue  and 
mqueue  was  not  taken  for  that  interval.  The  phenomenon,  to  be  discussed  later,  is  also  present  in 
vqueue,  but  to  a  lesser  degree.  The  data  presented  for  dhqueue  does  not  reflect  time  for  expan- 
sions or  contractions  during  exercising,  since  not  enough  items  are  ever  removed  to  precipitate 
this.  Until  the  number  of  processes  exceeds  eight,  the  variances  are  around  3%,  but  as  the 
number  of  processes  grows  the  variances  grow  steadily  to  about  25%. 

Secondly,  I  took  data  indicating  tqueue's  performance  for  different  values  of  k,  the  number 
of  children  of  each  node  in  the  tree.  Graph  10  shows  the  data  for  four  different  values  o(  k;  k  = 
2,  4,  6  and  7.  The  data  shown  here  is  statistically  very  good  with  a  maximum  variance  of  less 
than  2%. 


Serial  Performance,  N  =  1024 

Queue 

Time{ms) 

Load 

Empty 

Delete/ Re-insert  (R,) 

sUist 

61 

68 

.15 

Hist 

113 

105 

.23 

queue 

hqueuet 

dhqueue 

fqueue 

vqueue 

416 
251 
423 
1591 
363 

433 
508 
736 
1560 
376 

.85 

.75 

1.11 

2.34 

.73 

tqueue 
mqueue 

1066 
426 

1043 
458 

2.03 
.89 

set 

240 

261 

.50 

cset 

273 

778 

1.04 

pool 

221 

266 

.48 

t  Averagw  bucktt  depth  • 


Table  2. 
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Discussion 

Referring  to  Table  2,  the  serial  linked  list,  sUhl,  provides  a  reference  point  from  which  to 
judge  the  other  algorithms  raw  speed.  The  reference  allows  us  to  neglect  the  absolute  speed  of  the 
Ultra  computer  and  focus  on  the  relative  speeds  of  each  algorithm.  Neglecting  the  parallel  linked 
list,  Hist,  the  closest  in  speed  is  that  of  vqueue  or  hqueue,  which  are  each  about  five  times  as  slow. 
Since  these  two  algorithms  operated  with  a  near  perfect  efTiciency  up  until  they  encountered  PE 
contention,  they  would  need  five  dedicated  processors  in  order  to  equal  the  performance  of  the 
serial  linked  list  on  one  processor. 

The  number  of  dedicated  PEs  required  for  each  algorithm  to  out  perform  sllbl  is  shown  in 
Table  3,  below. 


Number  of  Processors  to  Equal  Performance  of  Serial  I  inked  Lists 

Queue 

Required  Number  of  Dedicated  PEs 
Serial  Speed/((S///j/  Speed)  x  (Efficiency  @,  8  Processes)) 

sllist 

Hist 

queue 

hqueue 

dhqueue 

fqueue 

vqueue 

1.0 
5.r 
5.7 
5.0 
9.3^ 
17.8^ 
4.9 

tqueue 
mqueue 

14.1^ 
15.0^ 

set 

cset 

pool 

3.3 
7.0 
3.1 

f  Efficiency  not  constant  ovtr  I  to  S  proetsses 


Table  3. 


Using  the  efficiency  at  eight  processes  is  not  strictly  valid  unless  the  efficiency  is  constant  over  the 
range  from  one  to  eight.  I  have  flagged  the  cases  where  it  was  not  constant  over  the  range,  but 
still  reported  the  numbers  since  they  may  be  of  interest.  In  general,  it  would  be  more  accurate  to 
determine  the  efficiency  as  a  function  of  the  number  of  dedicated  PEs.  Erom  Table  3,  it  is  clear 
that  an  eight  processor  machine  can  out  perform  a  similarly  equipped  serial  machine. 

Also  important,  may  be  the  fact  the  the  Ultracomputer  does  not  have  fast  divide  (and 
modulo)  or  multiply  hardware.  This  could  unfairiy  hurt  an  algorithm's  relative  raw  speed.  If 
these  slower  operations  take  place  in  critical  sections,  then  the  efficiency  would  be  expected  to  be 
less  than  is  optimal.  Tqueue  would  be  expected  to  be  effected  the  most,  since  a  divide  and  modulo 
operation  are  performed  at  each  level  of  the  tree.  The  hash  queue  algorithms  may  also  be  effected 
due  to  an  extra  multiply  and  modulo  in  the  hash  function,  although  dhqueue's  modulo  is  really 
just  a  logical  AND. 

In  the  next  analyses,  I  will  for  the  most  part  be  interested  in  efficiency;  if  an  algorithm  has  a 
good  efficiency  it  can  be  made  to  be  useful  by  putting  it  on  a  machine  with  enough  PEs,  regard- 
less of  raw  speed.  Also,  the  plotted  speeds  contain  bus  contention,  whereas  this  has  been  normal- 
ized out  of  the  efficiency  data. 
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Singleton  Queues 

Parallel  linked  lists,  llisl,  provides  another  nice  reference  against  which  the  other  algorithms 
can  be  compared.  Almost  any  algorithm  that  performs  better  than  Hist  can  be  considered  to  be  at 
least  somewhat  parallel.  Initially,  with  less  than  eight  processes,  all  the  queues  perform  better 
than  Hist,  but  as  the  processes  begin  to  compete  for  PEs  only  hqueuc  and  dhqueue  out  perform  it 
in  terms  of  efficiency. 

Fqueue  performs  considerably  well  until  there  are  more  processes  than  PHs.  This  is  most 
likely  due  to  context  switching  of  critical  processes.  The  efTcct  is  magnified  through  the  use  of 
gsynch,  because  it  forces  all  processes  to  reform  which  can  be  very  expensive  when  there  is  PE 
contention.  To  alleviate  this  problem  it  might  be  useful  to  have  a  premature  unlock  in  which  the 
task  doing  the  inserts  (the  first  phase)  can  leave  the  group  as  soon  as  they  have  finished.  The  size 
of  the  groups  may  also  play  an  important  role  here.  Also,  Eric  Frcudcnthal  in  NYU's  Uhracom- 
puter  group,  is  currently  working  on  hybrid  lock  algorithms  that  may  be  more  efficient  than  the 
ones  used  here. 

Dhqueue  performed  remarkably  well  considering  the  potential  for  extra  overhead.  It  does 
not  show  even  efficiency  out  to  eight  processes,  but  it  does  possess  a  graceful  decline  in  perfor- 
mance out  to  fourteen  processes.  This  is  something  none  of  the  parallel  queues  exhibit  and  is 
fairly  important  since  operating  systems  that  suddenly  become  very  slow  are  more  frustrating  than 
ones  that  do  so  gradually.  Somehow  the  user  is  more  compassionate  in  the  latter  case.  At 
present  I  don't  understand  this  behavior,  but  I  hope  to  be  able  to  examine  it  in  future  work. 

Overall,  hqueue  seems  to  be  the  best  of  all  the  queues.  It  has  a  perfect  efficiency  out  to  eight 
processes  and  drops  the  least  beyond  that.  It  is  also  one  of  the  fastest  of  the  truly  parallel 
queues.  It  would  be  the  clear  choice  in  any  application  in  which  the  queue  size  does  not  fluctuate 
too  much.  When  the  size  does  fluctuate  dhqueue  may  prove  to  be  a  better  choice  of  queue,  but 
for  the  poor  efficieny. 


PE  Contention 

At  this  point  you  might  ask  why  is  there  no  data  for  queue  and  mqueue  above  eight 
processes.  As  mentioned  before,  queue  and  mqueue  behave  in  a  similar  manner  to  that  of  vqueue, 
but  with  more  severe  results.  Performance  became  so  bad,  that  it  would  have  taken  days  to  get 
the  rest  of  the  data.  Each  of  these  three  queues  have  at  least  one  ticket  associated  with  either  a 
list  {queue,  mqueue)  or  a  cell  (vqueue).  The  behavior  is  attributable  to  three  things  in  order;  the 
ticket  scheme,  the  busy-waiting  synchronizations  and  the  scheduling.  1  am  not  implying  that  the 
wrong  ticket  algorithm  was  used,  rather  any  ticket  algorithm  would  have  produced  the  same 
result. 

After  a  few  minutes  of  running  more  processes  than  PEs,  these  three  factors  caused  all  the 
processes  to  be  waiting  on  a  single  cell  or  list  for  their  ticket  to  be  called.  This  becomes  fairly 
slow  when  the  process  whose  ticket  that  is  called  is  currently  context  switched  out.  It  becomes 
extremely  slow  because  of  the  following  scenario.  When  a  process  that  is  at  the  head  of  the  hne 
for  the  bottlenecked  cell/list  gets  scheduled,  it  inserts  and  deletes  all  the  way  around  the  queue 
until  it  comes  back  to  the  original  cell/list  where  it  is  then  at  the  end  of  the  line  at  the  same 
bottleneck.  This  continued  ad  infinitum  with  all  the  processes  waiting  on  the  same  cell/list.  Con- 
ceivably, one  could  get  only  circular _array _size  operations  per  scheduler  interrupt.  In  fact,  this  is 
probably  what  happened  since  vqueue  had  a  much  larger  array  (1024)  than  did  either  queue  or 
mqueue  (number  of  processes)  and  performed  the  best  of  the  three. 

This  behavior  is  partially  attributable  to  busy-waiting  since  it  forces  the  scheduler  to  choose 
which  process  is  context  switched  out.  A  method  that  chooses  whether  to  busy-wait  or  non- 
busy-wait  based  on  the  processes  position  in  line  seems  more  reasonable.  See  reference  [11|  on 
busy-waiting  versus  blocking  semaphores.  The  scheduler  is  at  fault  too  though.  After  that  first  in 
line  process  is  scheduled,  why  isn't  the  second  processes  in  line  ready  to  go?    If  this  were  to 
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happen  it  seems  that  at  the  very  least  the  list  that  is  being  waited  on  should  move  around  the 
queue.  This  was  observed  not  to  be  the  case.  Presumably  all  processes  had  the  same  priority  and 
selection  by  the  scheduler  is  therefore  essentially  random.  If  the  scheduler  knew  who  was  waiting, 
it  could  do  much  better. 

Most  importantly  though  is  the  fact  that  a  ticket  is  used  to  order  the  processes  well  before 
the  critical  section  is  encountered.  This  aUows  all  the  context  switching  that  caused  the  problem 
in  the  first  place.  The  algorithm  of  pool  is  almost  identical  to  that  of  queue,  but  does  not  enforce 
an  ordering  (i.e.  no  tickets)  and  does  not  exhibit  this  behavior.  The  two  hash  queues  avoid  the 
problem  by  tagging  the  items  with  the  position  in  the  queue  and  thereby  avoiding  the  need  to 
order  the  operations.  These  facts  indicate  that  preassigned  ticket  based  algorithms,  without  non- 
busy-waiting  semaphores,  will  not  be  optimal. 


Multi-Queues 

When  viewing  this  data,  remember  that  the  test  performed  here  is  not  a  fair  one  for  mqueue, 
in  that  items  with  unit  multiplicity  were  put  on  the  queues.  Mqueue  has  been  designed  strictly  as 
a  multi-queue  and  makes  no  attempt  to  provide  good  performance  as  a  singleton  queue.  Tqueue 
does  not  suffer  since  it  is  able  to  function  weU  as  a  singleton  queue  as  well.  Because  the  number 
of  processes  is  always  larger  than  the  multiplicity  of  any  item  on  an  mqueue,  as  a  process  tries  to 
obtain  the  next  item  on  the  queue  it  locks  out  any  others  who  also  try  to  delete  an  item.  This 
causes  a  serialization  of  the  processes  that  results  in  reduced  performance.  Fven  so,  mqueue  out 
performs  tqueue  in  raw  parallel  speed  until  about  the  seventh  or  eighth  process  is  added,  although 
tqueue  maintains  much  better  efficiency  ratings  throughout. 

Tqueue  also  suffers  some  serialization,  but  because  each  processes  is  forced  to  obtain  an  item 
from  a  unique  leaf  of  the  tree,  with  good  process  scheduling  serialization  is  minimal.  Serialization 
occurs  at  interior  nodes  on  insertion  when  the  tree  is  being  built.  Assume  two  processes,  A  and 
B,  enter  the  queue  at  the  root,  with  B  entering  immediately  after  A.  A  and  B  will  travel  the 
same  paths  down  the  tree  until  the  lowest  level  when  they  will  split  ofT  to  their  own  leaves.  Since 
A  entered  first,  it  will  be  creating  the  path  to  the  bottom  of  the  tree  as  it  goes.  This  means  that  B 
will  be  forced  to  wait  behind  A  until  the  very  last  internal  node.  With  many  processes  there  will 
be  many  such  threads  of  serialization  all  of  dilTerent  lengths  (n..Iog^N),  each  with  a  different 
number  of  processes.  In  the  best  case  two  processes  will  be  serialized  for  the  full  length.  The 
worst  case  occurs  when  the  left  most  process  in  any  subtree  blocks  the  other  child  processes  near 
the  top  of  the  tree.  This  would  most  likely  be  due  to  PF,  contention,  but  can  also  result  when  the 
critical  process  has  been  delayed  waiting  for  the  bus  or  any  other  resource. 

All  of  this  however,  is  not  meant  to  imply  that  n  insertions  by  n  processes  can  not  be  done 
in  O(log^N)  time.  If  there  are  a  random  number  of  waits  at  each  level,  independent  of  the 
number  of  inserters,  the  number  of  waits  is  only  a  function  of  the  depth  of  the  tree.  The  total 
time  for  all  processes  entering  at  the  same  time  and  directed  towards  the  smallest  sub-tree  will  be 
the  sum  of  the  maximum  wait  time  at  each  internal  node. 

The  tqueue  data  exhibits  a  decrease  in  performance  when  the  number  of  processes  equals 
six.  This  is  statistically  significant  and  in  the  second  set  of  data  we  will  see  that  this  is  not  a  coin- 
cidence. 

Overall,  tqueue  may  be  the  better  general  purpose  multi-queue  since  it  also  performs  well  as 
a  singleton  queue.  It  is  quite  a  bit  slower  than  mqueue,  but  does  have  a  guaranteed  upper  bound 
on  time  complexity  in  the  absence  of  PC  contention.  Mqueue  would  be  useful  in  a  situation 
where  multiplicities  are  equivalent  to  the  number  of  processes  using  the  queue,  or  where  there  is 
not  expected  to  be  much  contention  for  the  queue  among  processes. 
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Pools 

The  comparisons  among  the  pools  show  set  and  pool  to  be  dead  even  in  raw  speed.  Cset 
suffers  reduced  speed  because  of  the  overhead  of  the  separate  internal  queue  used  to  keep  track  of 
usable  words  in  the  bit  map.  set  and  cset  suffer  an  extra  decrease  in  raw  speed  since  they  utilize 
the  /aa-emulated  fas  and  faor{cset  only).  The  graphs  of  efficiency  show  that  both  sets  have  a  more 
graceful  decline  in  performance  than  does  pool.   As  discussed  earlier,  this  may  be  important. 

When  choosing  a  pool,  we  should  consider  their  applications.  Pool  can  be  used  to  maintain 
lists  of  available  resources.  The  two  sets'  primary  uses  are  for  maintaining  list  of  available  fags  or 
identifiers  (integers).  In  either  application  it  is  likely  there  will  be  one  pool  associated  with  many 
tasks.  For  pool,  the  size  of  the  pool  will  rise  linearly  in  the  number  of  processors,  while  for  the 
two  set  algorithms  space  does  not  grow  with  the  number  of  processes  only  with  the  number  of 
queues.  But  because  it  is  not  likely  that  the  number  of  queues  will  grow  with  the  number  of  pro- 
cessors the  two  sets  will  require  less  space  in  the  highly  parallel  system. 


Tqueue  Performance  vs.  K-ariness  of  Tree 

The  data  presented  in  Graph  10  clearly  demonstrates  a  performance  dependent  on  the  values 
of  k.  TTie  decrease  does  not  simply  occur  when  k  divides  the  number  of  processes  {p)  or  vice 
versa.  If  this  were  the  case,  I  would  expect  to  see  a  drop  for  k  =  2,  when  p=6,  since  there  was 
one  foT  p=  2,  4  and  8.    Rather  the  decreases  occur  in  two  cases 


1)  p  divides  k 

2)  /)  is  a  power  of  k  greater  than  one. 


This  phenomenon  results  because  of  the  busy-waiting  that  goes  on  at  internal  nodes  when  the 
process  assigned  to  allocate  the  next  node  in  the  tree  has  been  delayed.  This  results  in  a  forced 
synchronization  or  packetizing  of  the  processes  at  the  internal  nodes.  Note  that  this  is  not  due 
just  to  FE  contention,  as  it  occurs  even  when  adequate  PEs  are  available.  The  packetizing  has 
the  worst  consequences  in  the  two  cases  above.  In  ca.se  one,  there  is  one  group  of  size  k/p  that  is 
on  average  moving  through  the  queue  "together".  Because  the  size  of  the  group  evenly  divides  the 
number  of  leaves  in  a  sub-tree,  the  group  will  stay  packetized  as  it  moves  through  the  queue.  In 
case  two,  the  number  of  processes  is  equal  to  the  number  of  leaves  in  some  subtree  of  the  queue. 
In  this  situation  there  is  one  group  of  processes  that  are  on  average  grouped  at  the  lowest  root  of 
each  subtree  of  depth  \og^{p).  The  effect  becomes  worse  as  the  depth  of  the  given  subtree 
increases. 

An  important  result  of  this  is  that  the  value  of  k  should  be  chosen  as  a  prime  number  near 
the  expected  number  of  processes  that  will  be  using  the  queue.  This  would  effectively  eliminate 
the  affects  just  discussed. 


Conclusion 

Many  of  these  algorithms  require  the  graceful  overflow  of  integers  from  the  largest  represent- 
able  integer  to  zero  in  order  to  avoid  using  more  synchronizations  in  order  to  adjust  the  counters. 
For  this  reason  it  would  seem  that  any  parallel  machine  should  utilize  a  processor  that,  at  least 
optionally,  handles  overflow  without  trapping  or  otherwise  causing  problems. 

When  using  distributed  linked  lists,  as  in  queue,  it  is  better  to  tag  each  item  in  the  list  with  a 
number  indicating  its  position  in  the  queue,  rather  than  trying  to  keep  the  list  itself  ordered.    To 
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keep  the  list  ordered  one  must  order  the  inserts  and  possibly  the  deletes,  but  because  of  race  con- 
ditions this  is  somewhat  cumbersome.  Not  ordering  the  lists  and  putting  items  at  the  head  of 
lists,  the  approach  taken  by  hqueue  and  dhqueue,  means  that  in  the  worst  case  one  will  have  to 
search  through  all  the  items.  This  of  course  reduces  performance,  but  seems  necessary,  in  the 
absence  of  non-busy-waiting  semaphores,  to  efficiently  implement  a  distributed  cell  or  list  based 
queue. 

An  alternative  to  this  problem  is  the  technique  of  "temporary  non-preemption"  discussed  in 
|12|,  in  which  a  processes  can  be  in  either  of  two  states,  okpreempt  or  nnpreempt.  Okpreempt  is 
what  was  effectively  used  throughout  these  test.  Nopreemt  is  a  temporary  condition  that  can  be 
requested  by  the  process  so  as  to  effectively  eliminate  the  type  of  problem  seen  in  queue. 

We  just  saw  that  it  is  recommended  that  the  k-ariness  of  the  tree  in  tqueue,  be  chosen  as  a 
prime  number  near  the  expected  number  of  processes  using  the  queue.  This  concept  can  most 
likely  be  extended  to  the  general  class  of  parallel  tree  based  algorithms  as  well. 


Future  Work 

The  research  presented  here  suggests  several  avenues  for  future  work  in  the  areas  of  parallel 
queues  and  their  possible  improvements.  It  would  be  nice  to  test  more  queuing  algorithms  when 
they  become  available  (i.e.  developed).  Among  the  two  hash  queues  their  are  small  variants 
regarding  list  manipulation  and  interior  removal  that  could  also  be  tested.  For  example,  instead 
of  items  being  put  on  the  head  of  a  list,  they  could  be  put  on  the  end.  This  way,  the  next  item  to 
come  off  the  queue  wouldn't  be  pushed  deeper  intqkthe  list. 

A  more  critical  test  of  these  queues  might  be  performed  on  a  machine  that  supports  more 
than  eight  processors,  possible  30  (Sequent/Balance)  or  64  (IDM/RP3).  A  more  accurate  measure 
of  these  algorithms  scalabihty  could  then  be  made  on  these  machines.  This  after  all  is  of  critical 
importance.  Factors  of  five  to  ten  in  speed  up  are  nice  (Table  3),  but  can't  justify  the  time  and 
effort  being  put  into  parallel  systems  today.  If  however  the  scalability  is  seemingly  constant  with 
the  number  of  processors,  as  with  hqueue  and  the  pools,  than  there  is  plenty  of  justification. 

A  related  issue  that  must  be  resolved  is  the  effects  of  PE  contention.  For  example,  it  would 
be  useful  to  study  how  non-blocking  synchronizations  would  affect  the  pathologic  behavior  of 
queue  and  vqueue  for  instance.  Also  an  intelligent  scheduler  that  recognized  critical  processes  could 
be  used,  or  the  temporary  non-preemption  scheme  covered  in  |12|. 

Another  related  topic  is  that  of  "barrier"  effects  as  seen  in  /queue  and  tqueue  in  which  a 
group  of  processes  is  forced  to  proceed  only  as  fast  as  the  slowest  process  in  the  group.  In  some 
situations  this  may  not  be  avoidable,  but  there  may  be  a  way  to  reduce  the  clTccts.  For  instance, 
the  group  lock  family  could  include  a  routine  that  provide  premature  exit  from  a  group 

As  should  always  be  suggested,  is  the  work  on  new  synchronization  algorithms.  Eric  Freu- 
denthal  at  NYU's  Ultracomputer  group  is  currently  working  an  a  family  of  synchronizations  that 
he  says  can  provide  the  functionality  of  reader-reader,  reader-writer  or  group  locks.  These  and 
other  algorithms  (busy-waiting  or  blocking)  should  then  be  used  in  place  of  the  synchroniz.ations 
used  here  in  order  to  fmd  the  optimal  implementation  of  each  set  of  algorithms. 
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Appendix,  Cqlst  option  flags. 

The  following  are  the  option  flags  that  allow  almost  any  combination  of  the  options  above. 
When  illegal  combinations  are  specified,  such  as  multiple  item  insertion  on  sets,  the  user  is 
notified  and  the  options  are  automatically  adjusted. 

-q  <  Q  >  :  Q  determines  queue  type  to  be  tested. 

0  No  queuing  done,  base  timing. 


Fifo 

1 

3 
5 
7 

fqueue 
hqueue 
queue 
dhqueue 

2 
4 
6 
8 

tqueue 
mqueue 
llist 
sUist 

9 

vqueue 
Non-Fifo 

21 

set 

22 

cset 

23 

pool 

-m  <  M  >  :  M  is  the  number  of  queues  of  the  specified  type  to  use. 

-n  <  N  >  :  N  is  the  number  of  distinct  elements  to  be  distributed  among  the  M 

queues.  Default  =1024. 

-c  <  C  >  C  is  a  number  such  that  the  multiplicity  of  each  of  the  N  elements  is 

randomly  chosen  to  range  from  1  to  C.    Default  =   1. 

-s  <  S  >  :  S  is  the  number  of  children  to  spawn,  during  parallel  processing.    If 

loading  and  emptying  in  parallel  then  the  number  of  children  used  for  each 
task  is  S/2.    Default  =  8. 

-r  <  R  >  :  R  specifies  the  interior  removal  method  to  be  used. 

1  Removal  is  done  in  place  of  empty. 

2  Removal  is  done  before  any  emptying  is  done.    For  this  reason 
serial  load  is  forced.    Default  =  0,  no  removal. 

-l<c>  :  Load  queues  serially /parallelly.  c  =  s/p, 

s  --  serially  p  --  parallelly. 

-e  <  c  >  :  Empty  queues  serially/parallelly.  c  =  s/p, 

s  --  serially  p  --  parallelly. 


-ec  ;  Turn  on  queue  error  checking.  Default  is  none. 

-t  <  T  >  T  is  the  number  of  cycles  the  queues  are  run  through.    The  efl"ect  is 

to  run  cqtst  T  times.  This  is  used  mostly  just  to  gel  better  averages.  Since  the 
queue  is  not  reinitialized  for  each  cycle,  T  >  1  may  force  an  adjustment  of  the 
queue(s). 

-ps  :  Queue  exercising  is  to  be  done  serially. 


-pt  :  Produce  short  form  (terse)  print  out. 

-40- 


</; 


-p  <  P  >  P  is  the  number  of  delete/re-insert  operations  to  be  performed  per 

queue  per  child. 

-b  :  Turn  off  overhead  subtraction. 

-h  <  H  >  Valid  only  for  hqueues.    H  is  the  size  of  the  hash  table  to  be  used  for 

the  M  queues.  Defauh  =   1031. 
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